Skip to content

a SGE, python, implementation of the ENCODE consortium whole genome bisulfite sequencing pipeline.

Notifications You must be signed in to change notification settings

christacaggiano/ENCODE_WGBS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ENCODE WGBS WORKFLOW

End to end implementation of the ENCODE whole genome bisulfite sequencing workflow described here. Builds off DNANexus WGBS framework and work from the Myers lab at the HudsonAlpha Institute for Biotechnology. Developed in the Zaitlen Lab, Univeristy California at San Francisco.

Overview


Takes raw whole-genome bisulfite sequencing data (fastq files) and returns the number of methylated CpGs found at each site using the Bismark software package (documentation for this can be found here). Performs preliminary methylation analysis based on MethylKit.

Software Requirements


Needed to successfuly run bash script implementing workflow

  1. Bowtie 2 read aligner:

    • Bowtie 2 - used to align raw fastq reads to a genome of choice
  2. Bismark:

    • Bismark - a software tool developed at the Babraham Institute used to map bisulfite converted reads and determine cytosine methylation states
  3. Bisulfite converted genome:

    • Raw fasta files Needed by Bowtie to map reads. A few steps must be taken to prepare the genome file for use in Bowtie/Bismark.
      1. First, the genome must be built/indexed for Bowtie. This must be done just once before running this pipeline for the first time. This built version can be used for all subsequent runs. Documentation from the next-gen analysis software Homer provides an excellent tutorial for how to prepare the genome for bowtie.
        • Illumina iGenome provides several pre indexed/built genomes for popular model organisms.
      2. Next, the genome must be bisulfite converted for use in Bismark. The Bismark documentation provides guidance on this under the bismark_genome_preparation section. This step will probably take several hours.
  4. Trim Galore:

    • Trim Galore adapter trimmer - applies quality filters and adapter trimming to raw fastq files. Also developed at the Babraham Institute. Depends on a bunch of software itself
      1. Cutadapt - removes unwanted sequences from fastq files
      2. Fastqc - provides quality information on raw fastq files. Useful for generating visualizations.
  5. Samtools:

    • Samtools - needed for manipulation of bam/sam files generated by Bismark.

To Run

This pipeline is a Sun Grid Engine implementation, meaning that it uses a qsub_submit.sh script to schedule jobs and manage memory. This script assumes you have paired ended sequencing. Slight modifications can be made to make it usable for single ended sequencing.

qsub_submit.sh is the only script that needs to be modified to run the pipeline. The following variables need to be updated in order to run the pipeline:

QSUB parameters

  • -t: the number of samples to be processed
  • -l: the time to run the script for (default is maximum time allowed)
  • netapp=2G,scratch=2G: locations of scratch directories, depends on your system setup

Software Environment

  • GENOME_PATH: the location of your downloaded genome
  • BISMARK_PATH: location of the bismark installation
  • SAMTOOLS_PATH: location of samtools installation
  • TRIMGALORE_PATH: location of trimgalore installation
  • BOWTIE_PATH: location of bowtie installation

Location of input/output directories

  • input: location of fastq files to be processed
  • output: location to write output files to
  • fastq_prefix: how to reference and name your fastq files. In the example script, an SGE TASK ID is used as the prefix to enable easy parallelization. Script finds all fastq files in input location that satisfy input/prefix.fastq.gz
  • scripts: location of bismark shell scripts contained in this repository

Sample qsub submit script

################ qsub parameters ###########################
#$ -S /bin/bash                     
#$ -cwd                             
#$ -r y                            
#$ -j y                           
#$ -l mem_free=15G                 
#$ -l arch=linux-x64               
#$ -l netapp=2G,scratch=2G         
#$ -l h_rt=336:00:00
#$ -t 9:10  

################## environment ###############################
GENOME_PATH="/ye/zaitlenlabstore/christacaggiano/hg38/"
BISMARK_PATH="/ye/zaitlenlabstore/christacaggiano/Bismark"
SAMTOOLS_PATH="/ye/netapp/jimmie.ye/tools/samtools-1.3"
TRIMGALORE_PATH="/ye/zaitlenlabstore/christacaggiano/trim-galore/TrimGalore-0.4.3"
BOWTIE_PATH="/ye/zaitlenlabstore/christacaggiano/bowtie2-2.3.3"

################ experiment directories ######################
input="/ye/zaitlenlabstore/christacaggiano/cfDNA_project/second_primary_methylation_data_ALS/fastq_files/zaitlenn-"$SGE_TASK_ID"_BC"
output="/zaitlen/netapp/group/christa"
fastq_prefix=$SGE_TASK_ID
scripts="scripts"

################# runs the pipeline ##########################
python wgbs.py $input $output $fastq_prefix $scripts

Run Command

qsub qsub_submit.sh

Output

Sample output directory

11/
# fastq file split into ~7 million lines for easier processing
├── 11.R100
├── 11.R101
├── 11.R102
...
├── 11.R222
├── 11.R223
├── 11.R224
├── 11.R225
├── bam_files
# trimmed bam files with report on trimming
│   ├── 11.R100_val_1_bismark_bt2_pe.bam
│   ├── 11.R100_val_1_bismark_bt2_PE_report.txt
│   ...
│   ├── 11.R125_val_1_bismark_bt2_pe.bam
│   ├── 11.R125_val_1_bismark_bt2_PE_report.txt
# path to put any logs that occur during sequencing
├── log_path
# temporary files. useful for troubleshooting
├── temp_dir
│   ├── 11.R100_trimming_report.txt
│   ├── 11.R100_val_1.fq
│   ├── 11.R102_trimming_report.txt
... ...
│   └── 11.R225_val_2.fq
# bismark methylation files
└── unsortedButMerged_ForBismark_file
    ├── 11_unsorted_merged.bam # all trimmed bam files merged into one
    ├── 11_unsorted_merged.deduplicated.bam  # deduplicated bam file
    ├── 11_unsorted_merged.deduplication_report.txt  # report on deduplication
# methylation extraction performed by bismark
    └── methylation_extraction
        ├── 11_unsorted_merged.deduplicated.bedGraph.gz # percent methylated for each CpG
        ├── 11_unsorted_merged.deduplicated.bismark.cov.gz # count methylated and unmethylated for each CpG as well as % methylated
        ├── 11_unsorted_merged.deduplicated.M-bias.txt  # graph on methylation bias
        ├── 11_unsorted_merged.deduplicated_splitting_report.txt  # report on methylation calls
        ├── CHG_context_11_unsorted_merged.deduplicated.txt  # every CHG methylation call
        ├── CHH_context_11_unsorted_merged.deduplicated.txt # every CHH methylation call
        └── CpG_context_11_unsorted_merged.deduplicated.txt # every CpG methylation call 

About

a SGE, python, implementation of the ENCODE consortium whole genome bisulfite sequencing pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published