This User Guide outlines how reStrainingOrder works and gives details for each step.
reStrainingOrder should work with most types of Illumina sequencing reads. More specifically, we have tested it with ChIP- and Input-seq, RNA-seq as well as different tpyes of Bisulfite-seq (WGBS, PBAT). Aligners that were shown to work well with the N-masked genome approach include Bowtie2
, HISAT2
, STAR
and Bismark
.
Just download the latest version under releases, and extract the tar archive into a folder. Done.
tar xzf reStrainingOrder_v0.X.Y.tar.gz
reStrainingOrder requires a working version of Perl and Samtools to be installed on your machine. It is assumed that the samtools executable is in your PATH
unless the path is specified manually with:
--samtools_path </../../samtools>
While the genome preparation is not very resource hungry, the alignment and scoring part are a bit more demanding. Assuming single core operation, alignments to the multi-strain genome typically take up to 5GB of RAM for Bowtie2 or HISAT2, Bismark alignments may need between 12 and 18GB (directional/PBAT or non-directional alignments). The scoring part reStrainingOrder
currently takes ~20GB of RAM. For future versions we may look into reducing this memory footprint somewhat.
We would like to hear your comments or suggestions! Please e-mail me here!
This is a one-off process.
reStraining
is designed to read in a variant call file from the Mouse Genomes Project (download e.g. from The Mouse Genomes Project. It now assumes the GRCm39 mouse genome build by default, and uses the latest SNP annotation file (v8: mgp_REL2021_snps.vcf.gz); the previous version v5 (mgp.v5.merged.snps_all.dbSNP142.vcf.gz
) is no longer supported. reStraining
generates a new genome version where all positions found as a SNP in any of the strains (currently >50 different ones) are masked by the ambiguity nucleobase N
(N-masking). The entire process of filtering through ~80 million SNP positions and preparing the N-masked genome typically takes a few hours and requires some 6GB of memory.
If you don't have the mouse genome files already, you can download them from Ensembl, e.g. with a command like this:
wget ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/*dna.chromosome.*
Here is a sample command for the genome preparation step:
reStraining --vcf mgp_REL2021_snps.vcf.gz --reference Genomes/Mouse/GRCm39/
This command:
- creates a folder for the new N-masked genome
- produces a high confidence SNP matrix for chromosome 1
- creates a folder to store the SNP information per chromosome
- generates a SNP filtering and genome preparation report
N-masked genome folder
This folder (called MGP_strains_N-masked
) and its FastA contents are vital for subsequent steps. For sample commands to index the new N-masked sequence files please see below.
Chromosome 1 matrix file
The genome preparation command writes out a matrix file for chromosome 1 only (called MGPv8_SNP_matrix_chr1.txt.gz
), which is in the following format:
Chromosome Position REF ALT 129P2_OlaHsd 129S1_SvImJ 129S5SvEvBrd A_J AKR_J B10.RIII BALB_cByJ BALB_cJ BTBR_T+_Itpr3tf_J BUB_BnJ C3H_HeH C3H_HeJ C57BL_10J C57BL_10SnJ C57BL_6NJ C57BR_cdJ C57L_J C58_J CAST_EiJ CBA_J CE_J CZECHII_EiJ DBA_1J DBA_2J FVB_NJ I_LnJ JF1_MsJ KK_HiJ LEWES_EiJ LG_J LP_J MAMy_J MOLF_EiJ NOD_ShiLtJ NON_LtJ NZB_B1NJ NZO_HlLtJ NZW_LacJ PL_J PWK_PhJ QSi3 QSi5 RF_J RIIIS_J SEA_GnJ SJL_J SM_J SPRET_EiJ ST_bJ SWR_J WSB_EiJ ZALENDE_EiJ
1 3050050 C G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 3050069 C T 1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 1
1 3050115 G A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 3050118 G A 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
A score of 0 for a strain indicates that a given strain has the REF
base at this position, a call of 1 means that it contains the ALT
base with high confidence. This matrix file is used as input for the SNP scoring process (reStrainingOrder, see below ).
The matrix is written out for a single chromosome only to use less memory in the scoring process. In theory one could use any other chromosome as well (or even the whole genome, but with 70M positions this would be challenging...!). This is the SNP filtering summary:
SNP position summary for all MGP strains (based on mouse genome build GRCm39)
===========================================================================
Positions read in total: 6,449,162
Positions skipped because the REF/ALT bases were not well defined: 84,827
Positions discarded as no strain had a high confidence call: 698,557
Positions printed to THE MATRIX in total: 5,665,777
Please note: that only positions that have a single REF/ALT
genotype were considered (i.e. positions with several ALT positions for different strains (e.g. REF: A
, ALT: C,T
) were skipped for simplicity. Also, positions where the reference sequence did not have a DNA base or positions with no high confidence SNP call in any of the strains were skipped entirely.
In total, the chr1 matrix file contains ~5.6 million positions that were of high quality in one or more strains.
SNP folder
The folder SNPs_directory
stores files containing SNPs per chromosome. The files are in this format (similar to the chr1 matrix file but without header, shown here for chr11):
>11
11 3100106 G C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
11 3100127 A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
11 3100380 C A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
If you have no further use for these files they may be deleted afterwards to save disk space (they were used for the N-masking process and have already served their purpose).
SNP filtering and genome preparation reports
These files are intended for record keeping purposes.
This is again a one-off process.
Here are sample commands for some popular aligners using 4 cores each (just copy-paste). Depending on your resources this process may take up to a few hours.
Bowtie2:
bowtie2-build --threads 4 chr10.N-masked.fa,chr11.N-masked.fa,chr12.N-masked.fa,chr13.N-masked.fa,chr14.N-masked.fa,chr15.N-masked.fa,chr16.N-masked.fa,chr17.N-masked.fa,chr18.N-masked.fa,chr19.N-masked.fa,chr1.N-masked.fa,chr2.N-masked.fa,chr3.N-masked.fa,chr4.N-masked.fa,chr5.N-masked.fa,chr6.N-masked.fa,chr7.N-masked.fa,chr8.N-masked.fa,chr9.N-masked.fa,chrMT.N-masked.fa,chrX.N-masked.fa,chrY.N-masked.fa MGP.N-masked
HISAT2:
hisat2-build --threads 4 chr10.N-masked.fa,chr11.N-masked.fa,chr12.N-masked.fa,chr13.N-masked.fa,chr14.N-masked.fa,chr15.N-masked.fa,chr16.N-masked.fa,chr17.N-masked.fa,chr18.N-masked.fa,chr19.N-masked.fa,chr1.N-masked.fa,chr2.N-masked.fa,chr3.N-masked.fa,chr4.N-masked.fa,chr5.N-masked.fa,chr6.N-masked.fa,chr7.N-masked.fa,chr8.N-masked.fa,chr9.N-masked.fa,chrMT.N-masked.fa,chrX.N-masked.fa,chrY.N-masked.fa MGP.N-masked
Bismark:
bismark_genome_preparation --verbose --parallel 2 .
Here I will only briefly mention a few aspects that affect the alignment step (in no particular order):
-
Reads should be adapter- and quality-trimmed before aligning them to the
MGP.N-masked
genome. As we are scoring single bp matches/mismatches later on, sequences should be of good quality. To avoid a "garbage-in-garbage-out" scenario, a single run through Trim Galore or similar tool should suffice. -
Alignments to N-masked genomes take (considerably) longer than to their un-masked counterparts. In the interest of time, it might thus be advisable to use only a subset of the original input FastQ file(s). We have run tests with files down-sampled to 10, 5 or 2 million reads, which all arrived at the same answer. Keep in mind though that reStrainingOrder only assays chromosome 1 which is ~7% of the entire mouse genome. Example: if you started from say merely 1 million reads, only a few (ten-)thousand of them might eventually end up aligning to chromosome 1, and out of those you would then only look at the ones covering N-masked positions... I think our suggestion would be: down-sampling: yes, but not not too far down (maybe 5-20M as a rough guideline?).
-
The aligner you wish to employ needs to be capable of aligning to genomes containing
N
nucleotides. This has already been documented for SNPsplit here, but I will quickly list the most important points again here.
The output format of Bowtie2 is compatible with reStrainingOrder in its default (= end-to-end) mode. Alignments in --local
mode are not supported.
The output format of Bismark files with either Bowtie2 or HISAT2 are generally compatible with reStrainingOrder in their default mode (end-to-end mapping). Alignments in --local
mode are not supported.
DNA or RNA alignment files produced by HISAT2 also work well if you make sure that HISAT2 doesn’t perform soft-clipping. At the time of writing HISAT2 does perform soft-clipping (CIGAR operation: S
) by default, so you need to specify the option --no-softclip
.
Alignment files produced by the Spliced Transcripts Alignment to a Reference (STAR) aligner also should also work well, however a few steps need to be adhered to make this work:
1) Since reStrainingOrder only recognises the CIGAR operations M, I, D and N, alignments need to be run in end-to-end mode and not using local alignments (which may result in soft-clipping). This can be accomplished using the option: --alignEndsType EndToEnd
2) reStrainingOrder requires the MD:Z:
field of the BAM alignment to work out mismatches involving masked N positions. Since STAR doesn’t report the MD:Z:
field by default it needs to be instructed to do so, e.g.: --outSAMattributes NH HI NM MD
3) To save some time and avoid having to sort the reads by name, STAR can be told to leave R1 and R2 following each other in the BAM file using the option: --outSAMtype BAM Unsorted
The other very popular Burrows-Wheeler Aligner (BWA) is unfortunately not compatible with reStrainingOrder processing as was discussed in more detail here. Briefly, the reason for this is that BWA randomly replaces the ambiguity nucleobase N
in the reference by either C
, A
, T
or G
, thereby rendering an N-masked allele-sorting process impossible.
This mode assumes input data has been processed with the bisulfite mapping tool Bismark. reStrainingOrder will run a quick check at the start of a run to see if the file provided appears to be a Bismark file, and set the flags --bisulfite
automatically. Bisulfite (--bisulfite
) mode can also be set manually.
In contrast to the standard mode of using all known SNP positions, SNPs involving C to T transitions may not be used for allele-specific scoring since they might reflect either a SNP or a methylation state. This includes all of the following Reference/ALT combinations:
C/T or T/C for forward strand alignments, and G/A or A/G for reverse strand alignments.
The number of SNP positions that have been skipped because of this bisulfite ambiguity is reported in the report file. These positions may however be used to assign opposing strand alignments since they do not involve C to T transitions directly. For that reason, the bisulfite call processing also extracts the bisulfite strand information from the alignments in addition to the base call at the position involved. For any SNPs involving C positions that are not C to T SNPs both methylation states, i.e. C and T, are allowed to match the C position.
This step carries out the following tasks:
-
read and store matrix of high confidence SNP positions on chromosome 1 (
MGPv8_SNP_matrix_chr1.txt.gz
) -
read BAM file, identify reads overlapping genomic N-masked positions, and store frequency of detected bases at N-masked positions (
REF
/ALT
/OTHER
). This step discriminates between standard genomic or bisulfite converted reads (C/T positions cannot be used under certain conditions, see above)
Once the BAM file has finished processing, reStrainingOrder
calculates the following statistics:
- Pure strain compatibility scores
- All pairwise hybrid combination compatibility scores
- Allele-ratios for each hybrid combination
A sample command could look like this:
reStrainingOrder --snp MGPv8_SNP_matrix_chr1.txt.gz Spretus_10M_bowtie2.bam
The reStrainingOrder
command above generates a number of output files:
Spretus_10M_bowtie2.reStrainingOrder_report.txt
This file contains some general run statistics, e.g. the number reads processed, number of N-masked SNPs covered etc.
Spretus_10M_bowtie2.reStrainingOrder.strain_scores.txt
This tab-delimited text file contains the compatibility scores for potential single (pure) strains, and is in the format:
Strain // Positions covered // Agreeing GT // Disagreeing GT // agreement
The numbers for agreeing/disagreeing genotype (GT) are the sum of all counts from reads overlapping N-masked positions across the entire run. To score as agreeing
, a detected read base needs to match the expected base from the chr1 high-confidence matrix exactly. Any other base is scored as disagreeing
. The agreement is then calculated as a compatibility percentage score.
From what we have seen so far, if the pure strain compatibilty score is greater than ~ 98%, you can be reasonably sure that the strain you are looking at really is a pure strain. If this is the case, you should not look at potential hybrid combinations (as those numbers have to be equal or higher than pure the pure strain compatibility scores (see below)). As a side note, pure strain files tend to result in ~99.6% pure strain compatibility scores, the missing 0.4% are probably a results of mis-mapping events and/or incorrect SNP annotations.
Here is an example of a pure C57BL/6 strain (Black6) strain. If the single-strain compatibility score is > 98%, one doesn't really need to look any further. The allele-ratio (see below) also indicated that the sample is purely Black6.
Spretus_10M_bowtie2.reStrainingOrder.hybrid_scores.txt
This tab-delimited text file contains the compatibility scores for potential hybrid strain combinations, and is in the format:
Potential Hybrid // Agreeing // Disagreeing // Percentage agreement // Strain1 Index // Strain2 Index
To score as agreeing
for a given Strain 1/Strain 2 hybrid combination, a detected read base needs to match either the expected base of Strain 1 or Strain 2, from the chr1 high-confidence matrix. This means that the compatibility scores for potential hybrid combinations are by definition equal to or higher than the pure strain scores. Please do not bother looking at strain combinations if the pure strain compatibility scores already fully explain a sample genetic origin (e.g. >98% pure strain compatibility).
Spretus_10M_bowtie2.reStrainingOrder.hybrid_ratios.txt
This tab-delimited text file contains the allelic ratios for potential hybrid strain combinations, and is in the format:
Strain1 // Strain1 Count // Strain1 Percentage // Strain2 // Strain2 Count // Strain2 Percentage
This statistic can be very informative if you are expecting allelic ratios of 50:50, 25:75 etc. Pure strains should come out with allelic ratios of close to 100:0
Here is an example of a 129S1/CAST hybrid strain (public data taken from this GEO entry, aligned with Bismark/Bowtie2). The Hybrid Strain Compatibility Scores indicate that a 129S1_SvImJ/CAST_EiJ hybrid is > 99% compatible with the data. The allele-ratios between 129 and CAST are almost 50:50.
Spretus_10M_bowtie2.reStrainingOrder.summary_stats.txt
This file simply concatenates all informative stats from the files above. This file is used by reStrainingReport
to generate an HTML report.
Spretus_10M_bowtie2.reStrainingOrder.summary_stats.html
This file is generated by reStrainingReport
. It is called automatically at the end of each run of reStrainingOrder
, but can also be called stand-alone. This step extracts all stats from all *reStrainingOrder.summary_stats.txt
files in a working directory, and generates HTML reports. In its current implementation, the report is limited to the top 30 most likely pure strains/ strain combinations. Again, examples are available C57BL/6 strain (Black6) or 129S1/CAST hybrid.
reStrainingReport
uses plot.ly JS library (distributed with reStrainingOrder
) to render the plots.
reStrainingOrder was written by Felix Krueger at the Babraham Bioinformatics Group, now maintained at Altos Bioinformatics.