geneSpark

Installation

Requirements

Python >= 2.7
Pandas >= 0.14.1
Spark / PySpark
- https://spark.apache.org/downloads.html

How to run

Regular version:

python src/geneSpark.py [-o OUTPUT_FILE] [-u UPSTREAM_BASE_PAIRS] [-d DOWNSTREAM_BASE_PAIRS] INPUT_FILE

Spark version:

/PATH/TO/spark-VERSION/bin/spark-submit --master local[2] src/geneSpark_spark.py [-o OUTPUT_FILE] [-u UPSTREAM_BASE_PAIRS] [-d DOWNSTREAM_BASE_PAIRS] INPUT_FILE
This is just an example using 2 cores of a local machine. Change local[2] to customize to the number of cores in your machine. To learn how many cores your machine has, type sysctl -n hw.ncpu in the Terminal (command-line). For more options, please have a look at Spark documentation

The Apache Spark version of geneSpark runs approximately 5X faster (relative to geneSpark using only the Pandas and Numpy library of Python) in a local machine with 2 cores. In a local machine with 8 cores, the Spark version of geneSpark runs 11X faster. The more cores your system has, the faster Spark geneSpark finishes its run.

Apache Spark geneSpark is designed to scale up and leverage the power of thousands of computing cores of any HPC environment via the MapReduce framework.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Input		Input
Output		Output
src		src
.DS_Store		.DS_Store
COPYING		COPYING
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

geneSpark

Installation

Requirements

How to run

Regular version:

Spark version:

About

Releases

Packages

Contributors 2

Languages

License

Bohdan-Khomtchouk/geneSpark

Folders and files

Latest commit

History

Repository files navigation

geneSpark

Installation

Requirements

How to run

Regular version:

Spark version:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages