The pipeline is built using Nextflow, a workflow manager to run tasks across multiple compute infrastructures in a very portable manner. It supports conda package manager and singularity / Docker containers making installation easier and results highly reproducible.
The objective of the pipeline is to predict tumor-specific neoantigen based on both DNA and RNA next generation sequencing data from patients.
-
HLA typing is performed by seq2HLA (v2.2) on both MHCI and MHCII, based on the paired RNA fast files.
-
Detection of neoantigen is performed by the pVACtools suite (v4.1.1). The pipeline is divided into two parts, one focusing on DNA-based analysis (pVACseq) and the other one based on fusions events derived from RNAseq data (pVACfuse).
-
MiXCR (v4.5.0) was added to provide a fast analysis of raw T- or B- cell receptor repertoires.
-
Paired RNAseq reads are aligned using STAR (v2.7.6a) on the STAR index using the --quantMode TranscriptomeSAM option to obtain a transcriptome-based alignments BAM file. Per gene and per transcript TPM (transcript per million) are then estimated using Salmon (v1.10.2) with the adequate Gencode GFF3 and transcripts fasta files.
-
Small somatic variants (snvs, indels) were first called using the GATK Mutect2 (v4.1.8.0).
- Variants were annotated using VEP (ENSEMBL v110.1).
- Both gene (GX) and transcript (TX) expressions were then added using vatools (v5.1.0) and previously computed expression files
- RNA depth (RDP) and RNA allelic ratio (RAF) were then added using a combination of bcftools (v1.15.1), GATK SelectVariants (v4.1.9.0) and bam-readcount (v0.8).
-
pVACseq was then run using HLA typing files (for MHCI & MHCII) on the resulting variant file.
- Arriba (v2.4.0) was run on a subset of the original STAR aligned file containing only reads of putative relevance to fusion detection, such as unmapped and clipped reads.
- pVACfuse was then run on the list of filtered fusions of interest, using both HLA typing files.
-
sample_plan: csv file containing per-row samples information
-
assembly: the genome assembly for the analysis (example: hg38)
-
genomePath: path containing the different files described in "conf/genomes.config"
-
singularityImagePath: path to singularity images
-
vep_dir_cache: path to the downloaded VEP cache from those instructions (here: species="homo_sapiens" & version="110_GRCh38")
-
vep_plugin_repo: path to the VEP_plugins repository in which the Frameshift.pm was downloaded.
-
blacklist_tsv: file obtained from downloading arriba archive (in the /database folder) called "blacklist_${assembly}*.tsv.gz"
-
proteinGff: file obtained from downloading arriba archive (in the /database folder) called "protein_domain_${assembly}*.gff3"
-
mi_license: path to the "mi.license" file neeeded for mixcr, free for academic
-
tmpdir: path to temporary folder
nextflow run main.nf --samplePlan ${sample_plan} \
--genome ${assembly} \
--genomeAnnotationPath ${genomePath} \
--outDir ${outputDir} \
--singularityImagePath ${sif} \
--vepDirCache ${vep_dir_cache} \
--vepPluginRepo ${vep_plugin_repo} \
--miLicense ${mi_license} \
--tmpdir ${tmpdirp} \
-profile singularity,cluster \
-w ${tmp_dir} \
-resume
A sample plan is a csv file (comma separated) that lists all the samples with a biological IDs. The sample plan is expected to contain the following fields (with no header):
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex
Basic steps are the following: HLAtyping, RNAquant, pVacseq, pVacfuse, mixcr. They can be use separately (e.g.: --step HLAtyping or --step RNAquant or --step mixcr) or combined partially (e.g.: --step HLAtyping,RNAquant,pVacseq ; --step HLAtyping,pVacfuse) or all together (default mode ; --step HLAtyping, RNAquant, pVacseq, pVacfuse, mixcr) using the --step option.
If you only want to get HLA alleles (MHCI & MHCII), add the step "--step HLAtyping" to your command line. If you already have the two HLA allele files (MHCI & MHCII), add the full path to the sample plan as follow:
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,path_to_HLAI_file,path_toHLAII_file
If you only want to get transcript/gene based expression files (tpm), add the step "--step RNAquant" to your command line. If you already have the two gene-based and transcript-based expression files, add the full path to the sample plan as follow:
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,path_to_HLAI_file,path_toHLAII_file,path_to_gene_tpm_file,path_to_transcript_tpm_file
or, if you want to run the HLAtyping step (--step HLAtyping,RNAquant,pVacseq)
sampleID, sampleName, normalName, path_to_fastqDnaR1, path_to_fastqDnaR2, path_to_sampleDnaBam, path_to_sampleDnaBamIndex, path_to_vcf, path_to_fastqRnaR1, path_to_fastqRnaR2, path_to_sampleRnaBam, path_to_sampleRnaBamIndex,,,path_to_gene_tpm_file,path_to_transcript_tpm_file
Run the pipeline on the test dataset that will launch HLAtyping:
nextflow run main.nf -profile test,singularity --outDir ${outputDir} --singularityImagePath ${sif} -w ${work_dir}
This pipeline has been written by Institut Curie bioinformatics platform CUBIC (E.Girard, N.Servant). The project was funded by IMMUcan, the integrated European immuno-oncology profiling platform.
For any question, bug or suggestion, please use the issue system or contact the bioinformatics core facility.