diff --git a/README.md b/README.md index 6816753..4c6f150 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,9 @@ Summarised or piled up base calls of aligned reads to a reference genome. - *Column 6*: base qualities encoded as the `10 ^ -((ascii value of the character - 33) / 10)` - *Columns 7 - 3n*: coverages, reads, and base qualities of *n* pools (3 columns per pool). +### Variant call format (vcf) + +Canonical variant calling or genotype data format for individual samples. The [`vcf2sync`](#vcf2sync) utility is expected to work with vcf versions 4.2 and 4.3. See [VCFv4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) and [VCFv4.3](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for details in the format specifications. ### Sync An extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file. @@ -88,6 +91,10 @@ samtools mpileup \ -o /output/file.pileup ``` +### vcf2sync + +Convert the most widely used genotype data format, [variant call format (`*.vcf`)](https://samtools.github.io/hts-specs/VCFv4.3.pdf) into a [synchronised pileup format](#Sync), making use of allele depths to estimate allele frequencies and omitting genotype classes information including genotype likelihoods. This is so simply the allele frequency information we want from pools, populations and even polyploids, where genotype classes can be difficult if not impossible to extract. This utility should be compatible with vcf versions 4.2 and 4.3. + ### sync2csv Convert [synchronised pileup format](#Sync) into a matrix ($n$ pools x $p$ alleles across loci) and write into a comma-delimited (csv) file. @@ -152,10 +159,9 @@ Computes [Tajima's D](https://en.wikipedia.org/wiki/Tajima%27s_D) per sliding (o Genomewide unbiased determination of the modes of convergent evolution. Per population, significant troughs (selective sweeps) and peaks (balancing selection) are detected and the widths of which are measured. Per population pair, significant deviations from mean genomewide Fst within the identified significant Tajima's D peaks and troughs are also identified. Narrow Tajima's D troughs/peaks imply *de novo* mutation as the source of the genomic segment under selection, while wider ones imply standing genetic variation as the source. Pairwise Fst which are significantly higher than genomewide Fst imply migration of causal variants between populations, significantly lower implies independent evolution within each population, and non-significantly deviated pairwise Fst implies a shared source of the variants under selection. -### In the works +### impute -- **imputation**: imputation of missing loci -- **vcf2sync**: converting the most widely used genotype format `vcf` into the succint `sync` format. +Impute allele frequencies set to missing accoriding to another minimum depth parameter, i.e. `--min-depth-set-to-missing`. YOu may use the computationally efficient mean value imputation, or adaptive linkage disequilibrium-based k-nearest neighbour imputation (an extension of [LD-kNNi](https://doi.org/10.1534/g3.115.021667)). ## Details @@ -200,13 +206,11 @@ I have attempted to create a penalisation algorithm similar to elastic-net or gl ### TODO -[ ] Fst and Tajima's D estimation - -[ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions - -[ ] Simulation of genotype and phenotype data - -[ ] Imputation +- [X] Fst and Tajima's D estimation +- [X] Imputation +- [X] Canonical variant call format (vcf) file to sync file conversion +- [ ] Simulation of genotype and phenotype data +- [ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions Performs OLS or elastic-net regression to predict missing allele counts per window for each pool with at least one locus with missing data. This imputation method requires at least one pool without missing data across the window. It follows that to maximise the number of loci we can impute, we need to impose a maximum window size equal to the length of the sequencing read used to generate the data, e.g. 100 bp to 150 bp for Illumina reads. diff --git a/src/base/mod.rs b/src/base/mod.rs index 24ba64d..069ad76 100644 --- a/src/base/mod.rs +++ b/src/base/mod.rs @@ -1,7 +1,8 @@ -pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*}; +pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*, vcf::*}; mod helpers; mod phen; mod pileup; mod structs_and_traits; mod sync; +mod vcf; diff --git a/src/base/pileup.rs b/src/base/pileup.rs index f6ec660..6fb79e0 100644 --- a/src/base/pileup.rs +++ b/src/base/pileup.rs @@ -235,9 +235,8 @@ impl Filter for PileupLine { } /// Filter `PileupLine` by: - /// 1. removing nucleotide reads with counts less than `minimum coverage`, and read qualities less than `minimum quality` - /// 2. removing allele/s if the minor allele frequency is less than `min_allele_frequency` - /// 3. removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after previous filterings + /// - removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after filterings + /// Note that we are not removing alleles per locus if they fail the minimum allele frequency threshold, only if all alleles fail this threshold, i.e. when the locus is close to being fixed fn filter(&mut self, filter_stats: &FilterStats) -> io::Result<&mut Self> { // First, make sure we have the correct the correct number of expected pools as the phenotype file // TODO: Make the error pop-out because it's currently being consumed as None in the only function calling it below. diff --git a/src/base/structs_and_traits.rs b/src/base/structs_and_traits.rs index b24452e..3731651 100644 --- a/src/base/structs_and_traits.rs +++ b/src/base/structs_and_traits.rs @@ -12,6 +12,14 @@ pub struct FilePileup { pub pool_names: Vec, } +/// The alternative entry point genotype file struct, i.e. describing the genotype data in vcf format +/// Note: Make sure that the annotate the vcf file with allele depths, e.g. `bcftools mpileup -a AD ...` +/// - `filename` - filename of the vcf file (`*.vcf` or `*.vcf.gz`) +#[derive(Debug, Clone)] +pub struct FileVcf { + pub filename: String, +} + /// The main genotype file struct used for most of the analyses /// - `filename` - filename of the synchronised pileup file (`*.sync`) /// - `test` - name of statistical test, i.e. "sync2csv", "fisher_exact_test", "chisq_test", "fst", "heterozygosity, "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "genomic_prediction_cross_validation"" @@ -78,6 +86,18 @@ pub struct PileupLine { pub read_qualities: Vec>, // utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10) } +/// A line of a vcf file corresponding to a single locus across all the pools +/// We are interested in extracting allele counts. +/// We are not interested in the genotype calls and their corresponding likelihoods. +#[derive(Debug, Clone, PartialEq)] +pub struct VcfLine { + pub chromosome: String, // chromosome or scaffold name + pub position: u64, // position in number of bases + pub reference_allele: char, // reference allele + pub alternative_alleles: Vec, // vector of alternative alleles + pub allele_depths: Vec>, // across samples average utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10) +} + /// Allele counts at a locus across pools #[derive(Debug, Clone, PartialEq)] pub struct LocusCounts { diff --git a/src/imputation/adaptive_ld_knn_imputation.rs b/src/imputation/adaptive_ld_knn_imputation.rs index 0618df8..4ab9f52 100644 --- a/src/imputation/adaptive_ld_knn_imputation.rs +++ b/src/imputation/adaptive_ld_knn_imputation.rs @@ -314,7 +314,13 @@ impl GenotypesAndPhenotypes { // Set missing coverages to infinity to mark imputed data for j in 0..self.coverages.ncols() { // Mark only the imputed loci, i.e. loci which were not completely missing across all pools - let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum}); + let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| { + if x.is_nan() == false { + sum + 1 + } else { + sum + } + }); if n_non_missing > 0 { for i in 0..self.coverages.nrows() { if self.coverages[(i, j)].is_nan() { diff --git a/src/imputation/filtering_missing.rs b/src/imputation/filtering_missing.rs index 03a093f..cbe8cad 100644 --- a/src/imputation/filtering_missing.rs +++ b/src/imputation/filtering_missing.rs @@ -6,8 +6,10 @@ use crate::base::*; impl GenotypesAndPhenotypes { pub fn missing_rate(&mut self) -> io::Result { let (n, l) = self.coverages.dim(); - let sum = self.coverages.fold(0,|sum, &x| if x.is_nan(){sum + 1}else{sum}); - Ok(sensible_round(sum as f64 * 100.0 / ((n*l) as f64), 2)) + let sum = self + .coverages + .fold(0, |sum, &x| if x.is_nan() { sum + 1 } else { sum }); + Ok(sensible_round(sum as f64 * 100.0 / ((n * l) as f64), 2)) } pub fn set_missing_by_depth( diff --git a/src/imputation/mean_imputation.rs b/src/imputation/mean_imputation.rs index 361a556..b47d200 100644 --- a/src/imputation/mean_imputation.rs +++ b/src/imputation/mean_imputation.rs @@ -42,7 +42,13 @@ impl GenotypesAndPhenotypes { // Set missing coverages to infinity to mark imputed data for j in 0..self.coverages.ncols() { // Mark only the imputed loci, i.e. loci which were not completely missing across all pools - let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum}); + let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| { + if x.is_nan() == false { + sum + 1 + } else { + sum + } + }); if n_non_missing > 0 { for i in 0..self.coverages.nrows() { if self.coverages[(i, j)].is_nan() { @@ -151,7 +157,7 @@ pub fn impute_mean( let out = genotypes_and_phenotypes .write_csv(filter_stats, keep_p_minus_1, out, n_threads) .unwrap(); - + Ok(out) } diff --git a/src/main.rs b/src/main.rs index dc29803..d965fe5 100644 --- a/src/main.rs +++ b/src/main.rs @@ -27,7 +27,7 @@ use popgen::*; long_about = "Quantitative and population genetics analyses using pool sequencing data: trying to continue the legacy of the now unmaintained popoolation2 package with the memory safety of Rust." )] struct Args { - /// Analysis to perform (i.e. "pileup2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute") + /// Analysis to perform (i.e. "pileup2sync", "vcf2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute") analysis: String, /// Filename of the input pileup or synchronised pileup file (i.e. *.pileup, *.sync, *.syncf, or *.syncx) #[clap(short, long)] @@ -204,6 +204,19 @@ fn main() { base::pileup_to_sync, ) .unwrap(); + } else if args.analysis == String::from("vcf2sync") { + // VCF INPUT + let file_vcf = base::FileVcf { + filename: args.fname, + }; + output = file_vcf + .read_analyse_write( + &filter_stats, + &args.output, + &args.n_threads, + base::vcf_to_sync, + ) + .unwrap(); } else { // SYNC INPUT let file_sync = base::FileSync {