drafted vcf2sync

jeffersonfparil · Nov 3, 2023 · 5ecce9b · 5ecce9b
1 parent 326c28a
commit 5ecce9b
Show file tree

Hide file tree

Showing 8 changed files with 71 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -48,6 +48,9 @@ Summarised or piled up base calls of aligned reads to a reference genome.
 - *Column 6*:       base qualities encoded as the `10 ^ -((ascii value of the character - 33) / 10)`
 - *Columns 7 - 3n*: coverages, reads, and base qualities of *n* pools (3 columns per pool).
 
+### Variant call format (vcf)
+
+Canonical variant calling or genotype data format for individual samples. The [`vcf2sync`](#vcf2sync) utility is expected to work with vcf versions 4.2 and 4.3. See [VCFv4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) and [VCFv4.3](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for details in the format specifications.
 ### Sync
 
 An extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
@@ -88,6 +91,10 @@ samtools mpileup \
     -o /output/file.pileup
 ```
 
+### vcf2sync
+
+Convert the most widely used genotype data format, [variant call format (`*.vcf`)](https://samtools.github.io/hts-specs/VCFv4.3.pdf) into a [synchronised pileup format](#Sync), making use of allele depths to estimate allele frequencies and omitting genotype classes information including genotype likelihoods. This is so simply the allele frequency information we want from pools, populations and even polyploids, where genotype classes can be difficult if not impossible to extract. This utility should be compatible with vcf versions 4.2 and 4.3.
+
 ### sync2csv
 
 Convert [synchronised pileup format](#Sync) into a matrix ($n$ pools x $p$ alleles across loci) and write into a comma-delimited (csv) file.
@@ -152,10 +159,9 @@ Computes [Tajima's D](https://en.wikipedia.org/wiki/Tajima%27s_D) per sliding (o
 
 Genomewide unbiased determination of the modes of convergent evolution. Per population, significant troughs (selective sweeps) and peaks (balancing selection) are detected and the widths of which are measured. Per population pair, significant deviations from mean genomewide Fst within the identified significant Tajima's D peaks and troughs are also identified. Narrow Tajima's D troughs/peaks imply *de novo* mutation as the source of the genomic segment under selection, while wider ones imply standing genetic variation as the source. Pairwise Fst which are significantly higher than genomewide Fst imply migration of causal variants between populations, significantly lower implies independent evolution within each population, and non-significantly deviated pairwise Fst implies a shared source of the variants under selection.
 
-### In the works
+### impute
 
-- **imputation**: imputation of missing loci
-- **vcf2sync**: converting the most widely used genotype format `vcf` into the succint `sync` format.
+Impute allele frequencies set to missing accoriding to another minimum depth parameter, i.e. `--min-depth-set-to-missing`. YOu may use the computationally efficient mean value imputation, or adaptive linkage disequilibrium-based k-nearest neighbour imputation (an extension of [LD-kNNi](https://doi.org/10.1534/g3.115.021667)).
 
 ## Details
 
@@ -200,13 +206,11 @@ I have attempted to create a penalisation algorithm similar to elastic-net or gl
 
 ### TODO
 
-[ ] Fst and Tajima's D estimation
-
-[ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions
-
-[ ] Simulation of genotype and phenotype data
-
-[ ] Imputation
+- [X] Fst and Tajima's D estimation
+- [X] Imputation
+- [X] Canonical variant call format (vcf) file to sync file conversion
+- [ ] Simulation of genotype and phenotype data
+- [ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions
 
 Performs OLS or elastic-net regression to predict missing allele counts per window for each pool with at least one locus with missing data. This imputation method requires at least one pool without missing data across the window. It follows that to maximise the number of loci we can impute, we need to impose a maximum window size equal to the length of the sequencing read used to generate the data, e.g. 100 bp to 150 bp for Illumina reads.
 

diff --git a/src/base/mod.rs b/src/base/mod.rs
@@ -1,7 +1,8 @@
-pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*};
+pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*, vcf::*};
 
 mod helpers;
 mod phen;
 mod pileup;
 mod structs_and_traits;
 mod sync;
+mod vcf;
diff --git a/src/base/pileup.rs b/src/base/pileup.rs
@@ -235,9 +235,8 @@ impl Filter for PileupLine {
     }
 
     /// Filter `PileupLine` by:
-    /// 1. removing nucleotide reads with counts less than `minimum coverage`, and read qualities less than `minimum quality`
-    /// 2. removing allele/s if the minor allele frequency is less than `min_allele_frequency`
-    /// 3. removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after previous filterings
+    /// - removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after filterings
+    /// Note that we are not removing alleles per locus if they fail the minimum allele frequency threshold, only if all alleles fail this threshold, i.e. when the locus is close to being fixed
     fn filter(&mut self, filter_stats: &FilterStats) -> io::Result<&mut Self> {
         // First, make sure we have the correct the correct number of expected pools as the phenotype file
         // TODO: Make the error pop-out because it's currently being consumed as None in the only function calling it below.

diff --git a/src/base/structs_and_traits.rs b/src/base/structs_and_traits.rs
@@ -12,6 +12,14 @@ pub struct FilePileup {
     pub pool_names: Vec<String>,
 }
 
+/// The alternative entry point genotype file struct, i.e. describing the genotype data in vcf format
+/// Note: Make sure that the annotate the vcf file with allele depths, e.g. `bcftools mpileup -a AD ...`
+/// - `filename` - filename of the vcf file (`*.vcf` or `*.vcf.gz`)
+#[derive(Debug, Clone)]
+pub struct FileVcf {
+    pub filename: String,
+}
+
 /// The main genotype file struct used for most of the analyses
 /// - `filename` - filename of the synchronised pileup file (`*.sync`)
 /// - `test` - name of statistical test, i.e. "sync2csv", "fisher_exact_test", "chisq_test", "fst", "heterozygosity, "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "genomic_prediction_cross_validation""
@@ -78,6 +86,18 @@ pub struct PileupLine {
     pub read_qualities: Vec<Vec<u8>>, // utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10)
 }
 
+/// A line of a vcf file corresponding to a single locus across all the pools
+/// We are interested in extracting allele counts.
+/// We are not interested in the genotype calls and their corresponding likelihoods.
+#[derive(Debug, Clone, PartialEq)]
+pub struct VcfLine {
+    pub chromosome: String,             // chromosome or scaffold name
+    pub position: u64,                  // position in number of bases
+    pub reference_allele: char,         // reference allele
+    pub alternative_alleles: Vec<char>, // vector of alternative alleles
+    pub allele_depths: Vec<Vec<u64>>, // across samples average utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10)
+}
+
 /// Allele counts at a locus across pools
 #[derive(Debug, Clone, PartialEq)]
 pub struct LocusCounts {

diff --git a/src/imputation/adaptive_ld_knn_imputation.rs b/src/imputation/adaptive_ld_knn_imputation.rs
@@ -314,7 +314,13 @@ impl GenotypesAndPhenotypes {
         // Set missing coverages to infinity to mark imputed data
         for j in 0..self.coverages.ncols() {
             // Mark only the imputed loci, i.e. loci which were not completely missing across all pools
-            let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum});
+            let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| {
+                if x.is_nan() == false {
+                    sum + 1
+                } else {
+                    sum
+                }
+            });
             if n_non_missing > 0 {
                 for i in 0..self.coverages.nrows() {
                     if self.coverages[(i, j)].is_nan() {

diff --git a/src/imputation/filtering_missing.rs b/src/imputation/filtering_missing.rs
@@ -6,8 +6,10 @@ use crate::base::*;
 impl GenotypesAndPhenotypes {
     pub fn missing_rate(&mut self) -> io::Result<f64> {
         let (n, l) = self.coverages.dim();
-        let sum = self.coverages.fold(0,|sum, &x| if x.is_nan(){sum + 1}else{sum});
-        Ok(sensible_round(sum as f64 * 100.0 / ((n*l) as f64), 2))
+        let sum = self
+            .coverages
+            .fold(0, |sum, &x| if x.is_nan() { sum + 1 } else { sum });
+        Ok(sensible_round(sum as f64 * 100.0 / ((n * l) as f64), 2))
     }
 
     pub fn set_missing_by_depth(

diff --git a/src/imputation/mean_imputation.rs b/src/imputation/mean_imputation.rs
@@ -42,7 +42,13 @@ impl GenotypesAndPhenotypes {
         // Set missing coverages to infinity to mark imputed data
         for j in 0..self.coverages.ncols() {
             // Mark only the imputed loci, i.e. loci which were not completely missing across all pools
-            let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum});
+            let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| {
+                if x.is_nan() == false {
+                    sum + 1
+                } else {
+                    sum
+                }
+            });
             if n_non_missing > 0 {
                 for i in 0..self.coverages.nrows() {
                     if self.coverages[(i, j)].is_nan() {
@@ -151,7 +157,7 @@ pub fn impute_mean(
     let out = genotypes_and_phenotypes
         .write_csv(filter_stats, keep_p_minus_1, out, n_threads)
         .unwrap();
-    
+
     Ok(out)
 }
 

diff --git a/src/main.rs b/src/main.rs
@@ -27,7 +27,7 @@ use popgen::*;
     long_about = "Quantitative and population genetics analyses using pool sequencing data: trying to continue the legacy of the now unmaintained popoolation2 package with the memory safety of Rust."
 )]
 struct Args {
-    /// Analysis to perform (i.e. "pileup2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute")
+    /// Analysis to perform (i.e. "pileup2sync", "vcf2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute")
     analysis: String,
     /// Filename of the input pileup or synchronised pileup file (i.e. *.pileup, *.sync, *.syncf, or *.syncx)
     #[clap(short, long)]
@@ -204,6 +204,19 @@ fn main() {
                 base::pileup_to_sync,
             )
             .unwrap();
+    } else if args.analysis == String::from("vcf2sync") {
+        // VCF INPUT
+        let file_vcf = base::FileVcf {
+            filename: args.fname,
+        };
+        output = file_vcf
+            .read_analyse_write(
+                &filter_stats,
+                &args.output,
+                &args.n_threads,
+                base::vcf_to_sync,
+            )
+            .unwrap();
     } else {
         // SYNC INPUT
         let file_sync = base::FileSync {