Extension to the sns pipeline
This program is meant to be an extension to the sns wes
pipeline for bioinformatic analysis of whole/target exome sequencing data.
snsxt
is a BYOC framework (Bring Your Own Code) for running downstream analysis tasks on sns-wes pipeline output.
Use this framework to run any extra analysis tasks you like after an sns
pipeline analysis has finished.
- Create a new directory for your analysis
mkdir /path/to/analysis
cd /path/to/analysis
- Clone this repository and navigate to its directory
git clone --recursive https://github.com/NYU-Molecular-Pathology/snsxt.git
cd snsxt
- Run the
run.py
script
snsxt/run.py -d .
-d
,--analysis_dir
: Path to the to use for the analysis. For a new sns analysis, this will become the output directory. For an existing sns analysis output, this will become the input directory
-
-f
,--fastq_dir
: Directories containing .fastq.gz files (required for a newsns
analysis) -
-a
,--analysis_id
: An identifier for the analysis (e.g. NextSeq run ID) -
-r
,--results_id
: A sub-identifier for the analysis (e.g. a timestamp) -
-t
,--task-list
: A YAML formatted list of downstream analysis tasks forsnsxt
, defaults totask_lists/default.yml
-
--targets
: A .bed file with genomic regions for the analysis, defaults to the includedtargets.bed
file -
--probes
: Probes .bed file with regions for CNV analysis, defaults to the includedprobes.bed
file -
--pairs_sheet
: "samples.pairs.csv" samplesheet to use for paired analysis
Names and locations of these items may change with development
Starting at the parent snsxt
(this repo's parent dir):
-
snsxt
: main directory containing all code for the program -
snsxt/config
: configuration module for the main program -
snsxt/fixtures
: dummy analysis output files and directories for unit testing -
snsxt/logs/
: default program log output directory -
snsxt/sns_classes
: submodule with Python classes for interacting withsns
pipeline output -
snsxt/sns_tasks
: submodule containing additional analysis tasks to be performed in the program -
snsxt/util
: submodule with utility functions and classes for usage in the program -
snsxt/report
: directory containing files and configuration for the parent analysis report -
snsxt/logging.yml
: configurations for program logging -
snsxt/test.py
: script to run all unit tests in the program and its submodules -
snsxt/run.py
: main script used to run the program
The sns_tasks
submodule contains code for the various analysis tasks to be run in the program, which are derived from the AnalysisTask
custom class. Examples of other analysis task classes can be seen here and here, and a class template has also been included. Task classes must be imported into the sns_tasks/__init__.py
file in order to be made accessible to the rest of the program.
Tasks can come in a few flavors:
-
tasks that operate on the entire analysis at once
-
tasks that operate on a single sample at a time
Additionally, tasks can be run a few different ways:
-
run in the current program session
-
submitted as a compute job to the HPC cluster with
qsub
Each combination of task type and run type utilizes a separate 'run' function, which should be wrapped by the task's run()
method.
The snsxt
program uses a YAML formatted 'task list' file in order to determine which tasks should be run, and in what order. By default, the task_lists/default.yml
file is used. Tasks names listed should correspond to the name of the Python class for each analysis task, and extra parameters to be passed to the task's run()
function can be included.
You can add new analysis task modules to snsxt
by following this workflow:
- enter the
sns_tasks
subdirectory and make a copy of the :
cd snsxt/sns_tasks
cp _template.py _MyNewTask.py
-
edit the new task's custom Python class following the template shown, putting the main logic to run the task in the class's
main()
method, and setting therun()
method as a wrapper around the required parent run method. -
make a copy of the config file for the new module:
cp config/template.yml config/MyNewTask.yml
-
edit the new YAML config file with the corresponding info for the task (recommended to use Sublime Text or Atom)
-
import the module inside the
sns_tasks/__init__.py
-
add the new module to a task list to be run
Analysis task modules can have associated report files. These should be R Markdown formatted documents designed to be imported as child-documents to the parent report included in snsxt/report
. A module specific report can be added like this:
-
set the names of all report file(s) in the config for your task module.
-
add an entry for your new report and its data input directory in the
snsxt/report
config file.
The new report should now be detected by the parent reporting R Markdown document and included in the final report output.
Unit tests for the various modules included in the program can be run with the test.py
script. Individual modules can be tested with their corresponding test_*.py
scripts.
Designed and tested in Python 2.7
Designed to run on Linux systems, tested under CentOS 6
Requires pandoc
version 1.13+ for reporting
sh.py
is used as an included dependency.
sns pipeline output is required to run this.
snsxt
uses the util and sns_classes libraries as dependecies