This repository https://github.com/aryeelab/dna-methylation-tools contains a suite of tools to conduct DNA methylation data analysis. It is maintained by Divy Kangeyan at the Aryee Lab
This platform contains publicly accessible cloud-based preprocessing and quality control pipelines that go from raw data to CpG-level methylation estimates. The technologies covered include Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Hybrid Selection (capture) Bisulfite Sequencing (HSBS). Leveraging the Terra platform allows users to:
- Ensure cross-platform reproducibility of analyses
- Achieve scalability to large whole genome datasets with 100GB+ of raw data per sample, and to single-cell datasets with thousands of cells
- Provide access to best-practice analysis pipelines
- Enable integration and comparison between user-provided data and publicly available data (e.g. TCGA)
Analysis should be run in two successive processes:
- Per-sample preprocessing
- Aggregation and quality control analysis
If you plan to use the WDL workflows in your local computing environment, then this repository should be cloned. If you are using Terra, then you should either clone the Terra workspace aryee-lab/dna-methylation or if you already have your own workspace you can import the method configuration of your interest from aryee-lab/dna-methylation workspace.
FASTQ files and target coverage files can be uploaded to Terra using gsutil (https://pypi.org/project/gsutil/)
Before running the processes, you need to generate participants file and participant_set file. Each line in participant file specify a single sample.
entity:participant_id column specifies the sample name
bs_fastq1 & bs_fastq2 specify the paired fastq files.
If one sample has multiple FASTQ files for different lanes or runs they can be added in the same column with comma separation.
Each line in participant_set file specify a set of samples that should be aggregated and analyzed together.
membership:participant_set_id specify the name of the participant set.
participant_id column specify the names of the participants in the set.
Both of these files are tab separated text files. Examples of these files are shown in Terra_imports subdirectory.
In order to perform alignment and methylation calling choose bismark_rrbs, bismark_wgbs or bismark_hsbs method configuration with appropriate reference genome suffix. As the name indicates bismark_rrbs is for samples that are generated from Reduced Representation Bisulfite Sequencing (RRBS) with Mspl digestion and bismark_wgbs is for data generated from Whole Genome Bisulfite Sequencing (WGBS). bismark_hsbs is for data generated from Hybrid Selection Bisulfite Sequencing (HSBS). These worflows can also combine fastq files from multiple lanes if the samples are sequenced in such a way.
- Upload the fastq files to the Google cloud bucket
- Upload additional files such as target coverage bed file for HSBS sequencing
- In the Terra workspace choose bismark_rrbs, bismark_wgbs or bismark_hsbs method configuration with appropriate reference genome suffix
- Change other parameters according to preference
- Press Launch Analysis in upper right hand corner
- Choose the participants from the list of files
- Click Launch
If you are interested in conducting alignment and methylation calling for an entire participant set. Choose the participant set and in the box named Define expression type this.participants
You can observe the status of the job by going to Monitor tab
r1_fastq & r2_fastq
- Paired end FASTQ files
samplename
- base name for a sample
genome_index
- Reference genome to conduct bisulfite specicfic alignment
n_bp_trim_read1 & n_bp_trim_read1
- Number of bases to trim in read 1 and read 2. This is only specified for wgbs and hsbs workflows. rrbs workflow uses the --rrbs option from trimGalore.
chrom_sizes
- chrom_sizes file in order to generate the BIGWIG file
After the alignment and methylation calling each sample will have their methylation information and metadata stored in HDF5 format
In order to aggregate all of them and obtain the quality control report
- Choose aggregate_bismark_output method configuration with appropriate reference genome suffix
- Change other parameters according to preference
- Save it and press Launch Analysis
- Since root entity type in aggregation step is participant set, you will choose participant_set with participants of your interest
- Finally click Launch
To check the results from any of the work flows, go to Monitor tab, click View in the Status columns and then click the Workflow ID in the bottom of the page.
in_pe_reports_files
- Report files from the bismark alignment
in_covgz_files
- coverage output file
in_mbias_files
- M-bias file
BSGenome_targz
- BSGenome targz file
BSGenome_package
- Name of the BSGenome package
Genome_build
- Name of the genome build, e.g. mm10, hg19 etc.
cpu
- Number of CPUs
disks
- Name of the disk (Depends on the disk size)
memory
- Memory requirement
multicore
- Number of cores to use, especially during alignment
preemptible
- Preemptible option is to use preemptible virtual machine, if it is set to 0 the workflow runs uninterrupted. If it is set to any integer other than zero it may be interpreted that many times. However preemtible option reduces the computing cost significantly.
The per-sample preprocessing WDL workflow input parameters consist primarily of input file names and the Bismark genome index to use. A subset of the outputs from this step (Mbias files and methylated and unmethylated read coverage tables) are passed on as inputs to the subsequent aggregation and QC WDL workflow which in turn produces a QC report, aggregated methylation tables and a Bioconductor (bsseq) object. All tasks specify runtime parameters including disk size, CPU and memory requirements, and the docker image version to be executed.
Quality control analysis is conducted via the bioconductor package scmeth. QC analysis can be done independently with this package given that you have the appropriate objects.
In order to push/pull images from the Google container registry you must first authenticate with Google:
gcloud auth login
You then need to run a one-time setup to configure docker to use Google authentication:
gcloud auth configure-docker
Clone this repository
git clone aryeelab/dna-methylation-tools
cd dna-methylation-tools
If you want to run the examples below, download this small genome index and chrom.sizes file for mm10_chr 19:
cd testdata
wget https://storage.googleapis.com/aryeelab/bismark-index/mm10_chr19/bismark_mm10_chr19.tar.gz
wget https://storage.googleapis.com/aryeelab/chrom.sizes/mm10_chr19.chrom.sizes
Note: On Mac, cromwell can be installed with the following command
brew install cromwell
Users can also use the following command in order to install cromwell from the docker image in this repo.
docker build Docker/cromwell
Edit the file paths for FASTQ files, genome index, chrom.sizes file and monitoring script in the json file according to your directory paths to run the test sample.
{
"call_bismark_pool.r1_fastq": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/testdata/small_01_R1.fastq.gz",
"call_bismark_pool.r2_fastq": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/testdata/small_01_R2.fastq.gz",
"call_bismark_pool.chrom_sizes": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/mm10.chrom.sizes",
"call_bismark_pool.monitoring_script": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/monitor.sh",
"call_bismark_pool.genome_index": "[INSERT ABSOLUTE PATH]/dna-methylation-tools/bismark_mm10_chr19.tar.gz"
}
cromwell run -i ../pipelines/bismark_wgbs/bismark_wgbs.json ../pipelines/bismark_wgbs/bismark_wgbs.wdl
This step is only necessary if you want to combine multiple samples into a
single Bioconductor bsseq
object.
cromwell run ../pipelines/aggregate_bismark_output/aggregate_bismark_output.wdl -i ../pipelines/aggregate_bismark_output/aggregate_bismark.json
We estimated the time and cost for Whole Genome Bisulfite Sequencing (WGBS) sample from TCGA that contains approximately 290 million reads.
Also following plot shows the time & cost estimates for single cell samples in non-preemptible machine
For 10 sample set: samples were sequenced with a median of 741,214 reads (range of 414,443 to 1,825,672)
For 25 sample set: samples were sequenced with a median of 1,044,720 reads (range of 10,507 to 3,644,360)
Plots below are for same sample set but in preemptible machine
We also conducted similar analysis for the aggregation step
We have preprocessed and made available 47 WGBS samples available from TCGA. The raw (FASTQ) data, processed data and workflows are made available in a Terra workspace (See https://portal.firecloud.org/#workspaces/aryee-merkin/TCGA_WGBS_hg19). Processed data can also be found in bsseq format in tcgaWGBSData.hg19 package (https://bioconductor.org/packages/release/data/experiment/html/tcgaWGBSData.hg19.html).
Please use the Github Issues tracker with any issue you face with the platform. Any specific questions or comments, contact me at divyswar01@g.harvard.edu