Demultiplexing pipeline for Illumina data IN PROGRESS
nf-core/demultiplex is a bioinformatics demultiplexing pipeline used for multiple types of data input from sequencing runs. The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker / singularity containers making installation trivial and results highly reproducible.
The sample sheet must fall into the same format as seen below to adhere to the Illumina standards with the additional column of DataAnalysisType and ReferenceGenome to ensure 10X sample will be processed correctly. Order of columns does not matter but the case of column names does.
Lane | Sample_ID | User_Sample_Name | index | index2 | Sample_Project | ReferenceGenome | DataAnalysisType |
---|---|---|---|---|---|---|---|
1 | ABC11A2 | U_ABC0_BS_GL_DNA | CGATGT | PM10000 | Homo sapiens | Whole Exome | |
2 | SAG100A10 | SAG100A10 | SI-GA-C1 | SC18100 | Mus musculus | 10X-3prime | |
3 | CAP200A11 | UN1800_AE_6 | iCLIP | PM18200 | Homo sapiens | Other |
- Reformatting the input sample sheet
- Script looks for
iCLIP
in the index column of the sample sheet and collapses the iCLIP samples into one per lane. - Splits 10X single cell samples into 10X, 10X-ATAC and 10X-DNA .csv files by searching in the sample sheet column DataAnalysisType for
10X-3prime
,10X-ATAC
and10X-CNV
. - Outputs the results of needing to run specific processes in the pipeline (can be only 10X single cell samples, mix of 10X single cell with non single cell samples or all non single cell samples)
- Script looks for
- Checking the sample sheet for downstream error causing samples such as:
- a mix of short and long indexes on the same lane
- a mix of single and dual indexes on the same lane
- Processes that only run if there are issues within the sample sheet found by the sample sheet check process (CONDITIONAL):
- Creates a new sample sheet with any samples that would cause an error removed and create a a txt file of a list of the removed problem samples
- Run
bcl2fastq
on the newly created sample sheet and output the Stats.json file - Parsing the Stats.json file for the indexes that were in the problem samples list.
- Recheck newly made sample sheet for any errors or problem samples that did not match any indexes in the Stats.json file. If there is still an issue the pipeline will exit at this stage.
- Single cell 10X sample processes (CONDITIONAL):
Will run either CellRanger, CellRangerATAC, CellRangerDNA depending on the samplesheet data type
NOTE: Must create CONFIG to point to CellRanger genome References
- Cell Ranger mkfastq runs only when 10X samples exist. This will run the process with
CellRanger
,CellRanger ATAC
, andCell Ranger DNA
depending on which sample sheet has been created. - Cell Ranger Count runs only when 10X samples exist. This will run the process with
Cell Ranger Count
,Cell Ranger ATAC Count
, andCell Ranger DNA CNV
depending on the output from Cell Ranger mkfastq. 10X reference genomes can be downloaded from the 10X site, a new config would have to be created to point to the location of these. Must add config to point Cell Ranger to genome references if used outside the Crick profile.
- Cell Ranger mkfastq runs only when 10X samples exist. This will run the process with
bcl2fastq
(CONDITIONAL):- Runs on either the original sample sheet that had no error prone samples or on the newly created sample sheet created from the extra steps.
- This is only run when there are samples left on the sample sheet after removing the single cell samples.
- The arguments passed in bcl2fastq are changeable parameters that can be set on the command line when initiating the pipeline. Takes into account if Index reads will be made into FastQ's as well
FastQC
runs on the pooled fastq files from all the conditional processes.FastQ Screen
runs on the pooled results from all the conditional processes.MultiQC
runs on each projects FastQC results produced.MultiQC_all
runs on all FastQC results produced.
The nf-core/demultiplex pipeline comes with documentation about the pipeline, found in the docs/
directory:
- Installation
- Pipeline configuration
- Running the pipeline
- Output and how to interpret the results
- Troubleshooting
Credits
The nf-core/demultiplex pipeline was written by Chelsea Sawyer of the The Bioinformatics & Biostatistics Group for use at The Francis Crick Institute, London.
Many thanks to others who have helped out along the way too, including (but not limited to): @ChristopherBarrington
, @drpatelh
, @danielecook
, @escudem
, @crickbabs