In this demo we will explore how to get a taxonomic profile from a metagenomic experiment using PathoScope 2.0. A more in-depth tutorial can be found in the PathoScope repo here
PathoScope is a modular piece of software that will allow you to go all the way from a fastq file to a text file (typically tab-delimited) with columns representing genomes, their proportions, etc.
There are 6 PathoScope modules, however, for this demo we will focus on the three most important ones:
- PathoLib - Allows user to automatically generate custom reference genome libraries for specific scenarios or datasets
- PathoMap - Aligns reads to target reference genome library and removes sequences that align to the filter and host libraries
- PathoID - Reassigns ambiguous reads, identifies microbial strains present in the sample, and estimates proportions of reads from each genome
Once you run your samples through PathoScope, you can easily import the outputfiles into R for downstream exploratory data analysis and statistical inferences.
The only dependencies for PathoScope are Bowtie2 and python 2.7.3 or higher. Make sure that both are in your PATH by issuing something like echo $PATH
PathoScope is now hosted in GitHub so you can easily get it by issuing the following command from the Terminal
git clone https://github.com/PathoScope/PathoScope.git
And PS_demo
git clone https://github.com/ecastron/PS_demo.git
We are going to use data from a study exploring microbiome diversity in oropharingeal swabs from schizophrenia patients and healthy controls. The SRA accession number is SRR1519057
.
This file is probably too big for a demo so I randomly subsampled the reads down to a more manageable size (~40 M to 40 K reads)
- Go ahead and download the data here
- Now you need at least two files, one to be used as target library (where your reads are going to be mapped) and another one to be used as filter library (internal controls, host genome, contaminants, etc. that you want to remove)
As target library, you can use any multi fasta file containing full or draft genomes, or even nucleotide entries from NCBI, and combinations of both. The only condition is that the fasta entries start with the taxonomy ID from NCBI as follows:
Originally:
>gi|40555938|ref|NC_005309.1| Canarypox virus, complete genome
but PathoScope likes:
>ti|44088|gi|40555938|ref|NC_005309.1| Canarypox virus, complete genome
You could do this very easily in PathoLib:
python pathoscope.py LIB -genomeFile my_file.fasta -outPrefix target_library
Alternatively, we provide the entire NCBI nucleotide database already formatted [here] (ftp://pathoscope.bumc.bu.edu/data/nt_ti.fa.gz) (10 GB file). You could also use PathoLib to subsample this big file (50 GB uncompressed) and select only the taxa that you want. For instance, obtaining all the virus entries in nt_ti.fa (virus taxonomy ID = 10239)
python pathoscope.py -LIB python pathoscope.py LIB -genomeFile nt_ti.fa -taxonIds 10239 --subTax -outPrefix virus
Or in order to create a filter library, say all human sequences:
python pathoscope.py -LIB python pathoscope.py LIB -genomeFile nt_ti.fa -taxonIds 9606 --subTax -outPrefix human
However, I'm providing a target and filter library already formatted that you can download here and here. The target library is a collection of genomes from the reference library of the Human Microbiome Project (description here), and the filter library is simply the human genome (hg19). We are also going to use another filter library as well (phix174) to get rid of all the reads mapping to the Illumina internal control sequence that is sometimes added to sequencing experiments.
Once you have your data and target and filter libraries, we are ready to go ahead with the mapping step. For this we use bowtie2 so we will need to tell PathoMap where the bowtie2 indices are. If you don't have bowtie2 indices, not a problem, PathoMap will create them for you. And if your fasta files are larger than 4.6 GB (Bowtie2 limit), not a problem either, PathoMap will split your fasta files and create indices for each one of the resulting files.
If you have fasta files and not bowtie2 indices:
python pathoscope.py MAP -U ES_211.fastq -targetRefFiles HMP_ref_ti_0.fa,HMP_ref_ti_1.fa -filterRefFiles human.fa,phix174.fa -outDir . -outAlign ES_211.sam -expTag DAV_demo
But if you already have Bowtie2 indices (our case), you can issue the following command:
python pathoscope.py MAP -U ES_211.fastq -targetIndexPrefixes HMP_ref_ti_0,HMP_ref_ti_1 -filterIndexPrefixes genome,phix174 -outDir . -outAlign ES_211.sam -expTag DAV_demo
Let's give it a try...
So that should have taken ~3 minutes to run. Now you have a number of things that were printed to the screen as well as files that were created. The summary of the STDOUT is:
Reads Mapped | Library |
---|---|
1053 | HMP_ref_ti_0 |
1132 | HMP_ref_ti_0 |
916 | genome |
0 | phix174 |
And you should have one .sam file per library, plus another file containing the reads mapped to all target libraries (DAV_demo-appendAlign.sam), a fastq file of the reads mapping to all targets (DAV_demo-appendAlign.fq), and the file you most care about: ES_211.sam
The last step in our demo is to obtain a taxonomic profile from ES_211.sam using the read reassignment model implemented in PathoID
python pathoscope.py ID -alignFile ES_211.sam -fileType sam -outDir . -expTag DAV -thetaPrior 1000000
After running the command line above, you should get a tab-delimited file with PathoScope's output, and an updated .sam file representing an alignment after PathoScope's reassignment model was applied.
If you want to see all the output files you should get, check out the output_files directory in the PS_demo repo.
At the top of the file in the first row, there are two fields called "Total Number of Aligned Reads" and "Total Number of Mapped Genomes". They represent the total number of reads that are aligned and the total number of genomes to which those reads align to in the given alignment file.
Columns in the TSV file:
- Genome:
This is the name of the genome found in the alignment file. - Final Guess:
This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) after pathoscope reassignment is performed. - Final Best Hit:
This represent the percentage of reads that are mapped to the genome in Column 1 after assigning each read uniquely to the genome with the highest score and after pathoscope reassignment is performed. - Final Best Hit Read Numbers:
This represent the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and after pathoscope reassignment is performed. - Final high confidence hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 50%-100% to this genome and after pathoscope reassignment is performed. - Final low confidence hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 1%-50% to this genome and after pathoscope reassignment is performed. - Initial Guess:
This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) before pathoscope reassignment is performed. - Initial Best Hit:
This represent the percentage of reads that are mapped to the genome in Column 1 after assigning each read uniquely to the genome with the highest score and before pathoscope reassignment is performed. - Initial Best Hit Read Numbers:
This represent the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and before pathoscope reassignment is performed. - Initial high confidence hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 50%-100% to this genome and before pathoscope reassignment is performed. - Initial low confidence hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 1%-50% to this genome and before pathoscope reassignment is performed.
Let's see an example analysis on data generated in PathoScope here