End to end implementation of the ENCODE whole genome bisulfite sequencing workflow described here. Builds off DNANexus WGBS framework and work from the Myers lab at the HudsonAlpha Institute for Biotechnology. Developed in the Zaitlen Lab, Univeristy California at San Francisco.
Takes raw whole-genome bisulfite sequencing data (fastq files) and returns the number of methylated CpGs found at each site using the Bismark software package (documentation for this can be found here). Performs preliminary methylation analysis based on MethylKit.
Needed to successfuly run bash script implementing workflow
-
Bowtie 2 read aligner:
- Bowtie 2 - used to align raw fastq reads to a genome of choice
-
Bismark:
- Bismark - a software tool developed at the Babraham Institute used to map bisulfite converted reads and determine cytosine methylation states
-
Bisulfite converted genome:
- Raw fasta files Needed by Bowtie to map reads. A few steps must be taken to prepare the genome file for use in Bowtie/Bismark.
- First, the genome must be built/indexed for Bowtie. This must be done just once before running this pipeline for the first time. This built version can be used for all subsequent runs. Documentation from the next-gen analysis software Homer provides an excellent tutorial for how to prepare the genome for bowtie.
- Illumina iGenome provides several pre indexed/built genomes for popular model organisms.
- Next, the genome must be bisulfite converted for use in Bismark. The Bismark documentation provides guidance on this under the bismark_genome_preparation section. This step will probably take several hours.
- First, the genome must be built/indexed for Bowtie. This must be done just once before running this pipeline for the first time. This built version can be used for all subsequent runs. Documentation from the next-gen analysis software Homer provides an excellent tutorial for how to prepare the genome for bowtie.
- Raw fasta files Needed by Bowtie to map reads. A few steps must be taken to prepare the genome file for use in Bowtie/Bismark.
-
Trim Galore:
- Trim Galore adapter trimmer - applies quality filters and adapter trimming to raw fastq files. Also developed at the Babraham Institute. Depends on a bunch of software itself
-
Samtools:
- Samtools - needed for manipulation of bam/sam files generated by Bismark.
This pipeline is a Sun Grid Engine implementation,
meaning that it uses a qsub_submit.sh
script to schedule jobs and manage memory. This script assumes you have paired ended sequencing. Slight modifications can be made to make it usable for single ended sequencing.
qsub_submit.sh
is the only script that needs to be modified to run the pipeline.
The following variables need to be updated in order to run the pipeline:
- -t: the number of samples to be processed
- -l: the time to run the script for (default is maximum time allowed)
- netapp=2G,scratch=2G: locations of scratch directories, depends on your system setup
- GENOME_PATH: the location of your downloaded genome
- BISMARK_PATH: location of the bismark installation
- SAMTOOLS_PATH: location of samtools installation
- TRIMGALORE_PATH: location of trimgalore installation
- BOWTIE_PATH: location of bowtie installation
- input: location of fastq files to be processed
- output: location to write output files to
- fastq_prefix: how to reference and name your fastq files. In the example script,
an SGE TASK ID is used as the prefix to enable easy parallelization. Script finds all fastq files in input location that satisfy
input/prefix.fastq.gz
- scripts: location of bismark shell scripts contained in this repository
################ qsub parameters ###########################
#$ -S /bin/bash
#$ -cwd
#$ -r y
#$ -j y
#$ -l mem_free=15G
#$ -l arch=linux-x64
#$ -l netapp=2G,scratch=2G
#$ -l h_rt=336:00:00
#$ -t 9:10
################## environment ###############################
GENOME_PATH="/ye/zaitlenlabstore/christacaggiano/hg38/"
BISMARK_PATH="/ye/zaitlenlabstore/christacaggiano/Bismark"
SAMTOOLS_PATH="/ye/netapp/jimmie.ye/tools/samtools-1.3"
TRIMGALORE_PATH="/ye/zaitlenlabstore/christacaggiano/trim-galore/TrimGalore-0.4.3"
BOWTIE_PATH="/ye/zaitlenlabstore/christacaggiano/bowtie2-2.3.3"
################ experiment directories ######################
input="/ye/zaitlenlabstore/christacaggiano/cfDNA_project/second_primary_methylation_data_ALS/fastq_files/zaitlenn-"$SGE_TASK_ID"_BC"
output="/zaitlen/netapp/group/christa"
fastq_prefix=$SGE_TASK_ID
scripts="scripts"
################# runs the pipeline ##########################
python wgbs.py $input $output $fastq_prefix $scripts
qsub qsub_submit.sh
11/
# fastq file split into ~7 million lines for easier processing
├── 11.R100
├── 11.R101
├── 11.R102
...
├── 11.R222
├── 11.R223
├── 11.R224
├── 11.R225
├── bam_files
# trimmed bam files with report on trimming
│ ├── 11.R100_val_1_bismark_bt2_pe.bam
│ ├── 11.R100_val_1_bismark_bt2_PE_report.txt
│ ...
│ ├── 11.R125_val_1_bismark_bt2_pe.bam
│ ├── 11.R125_val_1_bismark_bt2_PE_report.txt
# path to put any logs that occur during sequencing
├── log_path
# temporary files. useful for troubleshooting
├── temp_dir
│ ├── 11.R100_trimming_report.txt
│ ├── 11.R100_val_1.fq
│ ├── 11.R102_trimming_report.txt
... ...
│ └── 11.R225_val_2.fq
# bismark methylation files
└── unsortedButMerged_ForBismark_file
├── 11_unsorted_merged.bam # all trimmed bam files merged into one
├── 11_unsorted_merged.deduplicated.bam # deduplicated bam file
├── 11_unsorted_merged.deduplication_report.txt # report on deduplication
# methylation extraction performed by bismark
└── methylation_extraction
├── 11_unsorted_merged.deduplicated.bedGraph.gz # percent methylated for each CpG
├── 11_unsorted_merged.deduplicated.bismark.cov.gz # count methylated and unmethylated for each CpG as well as % methylated
├── 11_unsorted_merged.deduplicated.M-bias.txt # graph on methylation bias
├── 11_unsorted_merged.deduplicated_splitting_report.txt # report on methylation calls
├── CHG_context_11_unsorted_merged.deduplicated.txt # every CHG methylation call
├── CHH_context_11_unsorted_merged.deduplicated.txt # every CHH methylation call
└── CpG_context_11_unsorted_merged.deduplicated.txt # every CpG methylation call