viGEN is a bioinformatics pipeline for the exploration of viral RNA in human NGS data. This pipeline can be extended to detect and quantify other microbes in RNA. In this tutorial, we provide and end to end workflow on how this pipeline can be used on an example data file to detect and quantify viruses.
- Some of the links to google drive folders may be changed by google. Please use this link to access all code, and output files to reproduce this example : https://drive.google.com/drive/folders/0B3-883ME4sP3Q3V4RVloa2tSTkE?resourcekey=0-1u6ysvjYt62l2OVJgeX_Zg&usp=sharing
- Due to google drive space restrictions, I am unable to host the input files anymore. Please download the file from here: https://www.ebi.ac.uk/ena/browser/view/SRR1946637
- Please email me at kb472 a_t georgetown.edu if you have any questions
- For this tutorial, we used a sample RNA-seq file from liver cancer from public study SRA http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878 or GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65485 (also available at EBI SRA here: http://www.ebi.ac.uk/ena/data/view/SRR1946637).
- About the dataset: Data consists of Liver cancer patients in China. The authors look for HBV-human integration sites in the dataset of 50 Liver cancer patients in China and 5 adjacent normal tissues. We downloaded the raw reads for one sample SRR1946637 and unzipped them. This sample is from an adjacent normal liver tissue.
A complete summary of the input, and output files from each step in this workflow is available via google drive here .
-
In this tutorial, we use tool RSEM for alignment
-
Install RSEM as shown in https://github.com/bli25ucb/RSEM_tutorial
-
RSEM uses aligner Bowtie1, Bowtie2 or STAR. We chose to use Bowtie1 which is available as pre-compiled executable. For Bowtie1 installation, follow instructions in this manual: http://bowtie-bio.sourceforge.net/manual.shtml#obtaining-bowtie
-
Test RSEM successful installation
./rsem-calculate-expression --version
and./rsem-calculate-expression --help
-
Generate transcriptome reference for RSEM:
RSEM uses a transcriptome reference for alignment. RSEM has a command called
rsem-prepare-reference
where it uses the genome fasta reference file and the GTF file to create the transcriptome reference. We have generated and provided these files for human hg19 reference genome in the RSEM_reference/Rsem_ref_hg19 folder via google drive.If you are interested in using another reference, please follow steps provided here to generate transcriptom reference for your file: https://github.com/bli25ucb/RSEM_tutorial
-
Once RSEM is installed, and the transcriptome reference is ready, run alignment using RSEM on SRA fastq files using following command:
./rsem-calculate-expression --paired-end -p 8 --output-genome-bam --keep-intermediate-files --bowtie-path /Users/username/Documents/software/bowtie/bowtie-1.1.2 --append-names --estimate-rspd --time input/SRR1946637_1.fastq input/SRR1946637_2.fastq Rsem_ref_hg19/Rsem_ref_hg19 output/SRR1946637
Output files: In addition to the aligned BAM file (genome level and transcriptome level), this will generate the unaligned (unmapped) fastq files named SRR1946637_un_1.fq and SRR1946637_un_2.fq. They consist of the reads that did not align to the human reference.
This workflow is part of “Module 1 (filtered human input files)” in Figure 1 in the viGEN manuscript .
-
We were interested in exploring all viruses existing in humans. So we first obtained reference genomes of all known and sequenced human viruses obtained from NCBI (746 viruses as of Sep 2015), and merged them into one file (referred to as the "viral reference file") in fasta file format. Merge all virus fasta file into one big fasta file called
viruses.fa
. For this tutorial, we have provided this file through google drive here. -
We have also indexed the viral reference file, so that these files are ready for alignment tools Bowtie2 (folder name:
virus.bowtie2.refB
) or BWA (folder name:viral.bwa.ref
) through google drive here. -
NCBI also allows to download information/annotation about these viruses from their web site. This information has been provided as Complete_Sequence_info.csv.
-
This pipeline can be extended to detect and quantify other microbes in RNA as long as the microbial sequence is referenced in NCBI. The user has to download the correspdoing
FASTA
sequence files of the microbes of interest from NCBI and concatenate them together to create a single referenceFASTA
file.
In case user is interested in creating a reference index the reference file on their own, this is the command to use: ./bowtie2-build /Path/viruses.fa virus.bowtie2.refB
.
- In this tutorial, we use alignment tool Bowtie2.
- Install Bowtie2 pre-compiled binary file based on instructions in this manual: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#obtaining-bowtie-2
- Perform alignment to generate output in the form of a SAM file. This SAM file contains the reads from various viruses aligned to the viral reference.
- Convert SAM to BAM using Samtools. Example command:
samtools view -bS in.sam > out.bam
- Sort the BAM file coordinate wise. Example command:
samtools sort out.bam > out.sorted
- Use
Samtools idxstats
tool. Example command:samtools idxstats out.sorted.bam > idxstats.txt
. This produces a tab delimited file, with each line consisting of a virus sequence name, sequence length, # mapped reads and # unmapped reads. The number of mapped reads is referred to as genome level counts. Using the genome level counts, we estimated the number of reads that covered the genome, a form of viral load or viral copy number. Viral copy number was defined as:the Number of mapped reads * Read Length / Genome Length
Only those viral species with copy number more than a threshold are selected for the next module.
We have provided this shell script viral.pipeline_public_final.sh that encompasses all of these above steps.
- If you have more than one sample, then it might be useful to concatenate the number of mapped reads (genome level counts) from each sample into a matrix format. This file can then be used for any further analysis.
- We have provided a fully reproducible code in the R programming language in the form of an R markdown file. It contains the code and output for how to merge all the
Samtools idxstats
files. A readable version of this file is here merge.idx.stats.pdf. This code also adds virus annotation using information from Complete_Sequence_info.csv file. - Only those viral species with copy number more than a threshold are selected for the next module. For this tutorial, we decided to use a threshold copy number of 10, so this gives us 19 viral genomes used in the next step. This file is available [here] (https://docs.google.com/spreadsheets/d/16vSWxLeUdiTXBNzudObbEGFNbXZroWznDnEtjhNF784/edit?usp=sharing) and also shown in the table below .
NC id | Genome Length | Name of virus | SRR1946637 (viGEN read count) | Copy number |
---|---|---|---|---|
NC_018464.1 | 927 | Shamonda virus | 82509 | 8900.65 |
NC_003977.1 | 3215 | Hepatitis B virus | 96329 | 2996.24 |
NC_018476.1 | 6895 | Simbu virus | 49784 | 722.03 |
NC_009823.1 | 9711 | Hepatitis C virus genotype 2 | 31470 | 324.07 |
NC_018478.1 | 4417 | Simbu virus | 13745 | 311.18 |
NC_018711.1 | 7195 | Lunk virus NKS-1 | 14251 | 198.07 |
NC_004102.1 | 9646 | Hepatitis C virus | 17922 | 185.8 |
NC_001672.1 | 11141 | Tick-borne encephalitis virus | 16336 | 146.63 |
NC_015521.1 | 7310 | Cutthroat trout virus | 10122 | 138.47 |
NC_022518.1 | 9472 | Human endogenous retrovirus K113 | 9028 | 95.31 |
NC_018382.1 | 6796 | Bat hepevirus | 6232 | 91.7 |
NC_009827.1 | 9628 | Hepatitis C virus genotype 6 | 8438 | 87.64 |
NC_009225.1 | 3245 | Torque teno midi virus 1 | 1581 | 48.72 |
NC_014093.1 | 3253 | Torque teno midi virus 2 | 1111 | 34.15 |
NC_027202.1 | 1904 | Punta Toro virus | 580 | 30.46 |
NC_001526.2 | 7905 | Human papillomavirus type 16 | 1976 | 25 |
NC_002645.1 | 27317 | Human coronavirus 229E | 5080 | 18.6 |
NC_009824.1 | 9456 | Hepatitis C virus genotype 3 | 1476 | 15.61 |
NC_010708.1 | 3621 | Thottapalayam virus | 416 | 11.49 |
- In this step, we use our in-house pipeline to generate gene and CDS counts is for every input BAM file.
- The region level information is extracted from Gene Feature Format files (GFF) files which are available for most viral genomes from NCBI.
- Download GFF files for the top viruses.
- For this tutorial, we have provided the GFF files here un_bowtie_topGff. Not all viruses have GFF files. In our example 18 of the 19 viruses had GFF files for download.
- Input to our in-house pipeline for Gene & CDS quantification: the viral bam files (the BAM file output from previous Bowtie2 step)
- We have provided fully reproducible code in the R programming language in the form of an R markdown file count.regions.in.regions.Rmd in this github repository, that generates these Gene and CDS counts. A readable version of this file is also provided here: https://github.com/ICBI/viGEN/blob/master/count.reads.in.regions.pdf We plan to integrate this code into our Bioconductor package.
- Output :
-
This code produces counts for each region in the GFF file. One file is created for each virus (GFF file) for each sample. Three types of counts are calculated - “in region”, “on boundary” and “in gaps”. This output is available as a “.csv” file and “Rdata” files.
-
Collate all read counts across all samples and across all top viruses to create a matrix. This code is provided as a fully reproducible code in the R programming language in the form of an R markdown file collate.output.files.Rmd. A readable version of this file is also provided here: https://github.com/ICBI/viGEN/blob/master/collate.output.files.pdf. The code calculates total of “in region” and “on boundary” (referred to as “sum”) when collating. This code will also add virus annotation using information from “Complete_Sequence_info.csv” file.
-
Application: These gene and CDS count files can be used to compare case and control groups of interest using popular tools like EdgeR (http://bioconductor.org/packages/edgeR/) or DESeq2 in Bioconductor that can accept raw counts.
-
The viruses that have the highest region counts are shown in table here. You can see various gene/CDS regions of Hepatitis B virus showing up on top with highest counts. This is a verification of the HBV status of the sample.
- Install Varscan2 from here: https://sourceforge.net/projects/varscan/files/
- Run Varscan2 on command line using the following command:
samtools mpileup -B -f /Users/ls483/Documents/SRA.GEO/viral.reference/viruses.fa -d 9999 -Q 17 -q 17 SRR1946637_un.bowtie2.sorted.bam| java -Xmx2g -jar /Users/ls483/Documents/software/varscan2/VarScan.v2.3.9.jar mpileup2cns --output-vcf 1 --min-var-freq 0.01 | awk '/^#/ || $7=="PASS"' > /Users/ls483/Documents/SRA.GEO/output_varscan2/SRR1946637_un.vcf
- This produces a the variants found in viruses in a standard variant call file (VCF) file format.
The VIGEN pipeline is very customizable.
- This pipeline can be extended to detect and quantify other microbes in RNA-seq as long as the sequence is known and available as a FASTA file.
- Download the FASTA files of interest from NCBI , and concaterate them to create the reference FASTA file for this pipeline.
- Build the reference index file
- Create the annotation file (similar to the example CSV file provided in this tutorial)
- The pipeline has been used to detect bacteria from RNA-seq data and has worked well.
- Although not tested, this pipeline could be extended to detect and quantify microbes in DNA-seq.
- Can use BWA or Bowtie2 for alignment of DNA-seq data. And remember to output the unmapped sequences into a separate file.
- The quantification step can only be done at the genome level, NOT at the transcriptome level.
- The variant calling step should also work
Please cite our work
-
Krithika Bhuvaneshwar, Lei Song, Subha Madhavan, and Yuriy Gusev. 'viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors', Frontiers in Microbiology (2018). https://www.ncbi.nlm.nih.gov/pubmed/29922260
-
This review paper has a nice comparison of our pipeline with 7 other well known pipelines (VirusFinder, VirusSeq, DAMIAN, VirTect, virDetect, MetaMap, Kraken) and shows that our pipeline performace was one of the best to detect known viruses in NCBI https://pubmed.ncbi.nlm.nih.gov/35753694/
-
Application of this pipeline on two types of coronaviruses (Pre-print) : https://doi.org/10.1101/2021.05.28.446250
If you are using these samples for testing this pipeline, please remember to cite this dataset from NCBI or EBI
- NCBI SRA http://www.ncbi.nlm.nih.gov/bioproject/PRJNA279878
- EBI SRA http://www.ebi.ac.uk/ena/data/view/SRR1946637
-
To reduce run time in this workflow, users can alternatively try other alignment tools including Kallisto, Bowtie2 etc instred of RSEM. I have found Kallisto to be the fastest in this workflow, and the output to be very similar to RSEM used in this example. Users must ensure to use the appropriate reference index files . And remember to output the unmapped sequences into a separate file.
-
Rmd files are R Markdown documents that contain narrative text and code in a nicely formatted document, that allow users to completely reproduce the code. Along with the R Markdown files, its PDF version have also been provided for users. The R Markdown files and PDF files have the same name and differ only by the file extension (.Rmd for R Markdown and .pdf for PDF files).