Skip to content

Commit

Permalink
drafted vcf2sync
Browse files Browse the repository at this point in the history
  • Loading branch information
jeffersonfparil committed Nov 3, 2023
1 parent 326c28a commit 5ecce9b
Show file tree
Hide file tree
Showing 8 changed files with 71 additions and 20 deletions.
24 changes: 14 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ Summarised or piled up base calls of aligned reads to a reference genome.
- *Column 6*: base qualities encoded as the `10 ^ -((ascii value of the character - 33) / 10)`
- *Columns 7 - 3n*: coverages, reads, and base qualities of *n* pools (3 columns per pool).
### Variant call format (vcf)
Canonical variant calling or genotype data format for individual samples. The [`vcf2sync`](#vcf2sync) utility is expected to work with vcf versions 4.2 and 4.3. See [VCFv4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) and [VCFv4.3](https://samtools.github.io/hts-specs/VCFv4.3.pdf) for details in the format specifications.
### Sync
An extension of [popoolation2's](https://academic.oup.com/bioinformatics/article/27/24/3435/306737) sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
Expand Down Expand Up @@ -88,6 +91,10 @@ samtools mpileup \
-o /output/file.pileup
```
### vcf2sync
Convert the most widely used genotype data format, [variant call format (`*.vcf`)](https://samtools.github.io/hts-specs/VCFv4.3.pdf) into a [synchronised pileup format](#Sync), making use of allele depths to estimate allele frequencies and omitting genotype classes information including genotype likelihoods. This is so simply the allele frequency information we want from pools, populations and even polyploids, where genotype classes can be difficult if not impossible to extract. This utility should be compatible with vcf versions 4.2 and 4.3.
### sync2csv
Convert [synchronised pileup format](#Sync) into a matrix ($n$ pools x $p$ alleles across loci) and write into a comma-delimited (csv) file.
Expand Down Expand Up @@ -152,10 +159,9 @@ Computes [Tajima's D](https://en.wikipedia.org/wiki/Tajima%27s_D) per sliding (o
Genomewide unbiased determination of the modes of convergent evolution. Per population, significant troughs (selective sweeps) and peaks (balancing selection) are detected and the widths of which are measured. Per population pair, significant deviations from mean genomewide Fst within the identified significant Tajima's D peaks and troughs are also identified. Narrow Tajima's D troughs/peaks imply *de novo* mutation as the source of the genomic segment under selection, while wider ones imply standing genetic variation as the source. Pairwise Fst which are significantly higher than genomewide Fst imply migration of causal variants between populations, significantly lower implies independent evolution within each population, and non-significantly deviated pairwise Fst implies a shared source of the variants under selection.
### In the works
### impute
- **imputation**: imputation of missing loci
- **vcf2sync**: converting the most widely used genotype format `vcf` into the succint `sync` format.
Impute allele frequencies set to missing accoriding to another minimum depth parameter, i.e. `--min-depth-set-to-missing`. YOu may use the computationally efficient mean value imputation, or adaptive linkage disequilibrium-based k-nearest neighbour imputation (an extension of [LD-kNNi](https://doi.org/10.1534/g3.115.021667)).
## Details
Expand Down Expand Up @@ -200,13 +206,11 @@ I have attempted to create a penalisation algorithm similar to elastic-net or gl
### TODO
[ ] Fst and Tajima's D estimation
[ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions
[ ] Simulation of genotype and phenotype data
[ ] Imputation
- [X] Fst and Tajima's D estimation
- [X] Imputation
- [X] Canonical variant call format (vcf) file to sync file conversion
- [ ] Simulation of genotype and phenotype data
- [ ] Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions
Performs OLS or elastic-net regression to predict missing allele counts per window for each pool with at least one locus with missing data. This imputation method requires at least one pool without missing data across the window. It follows that to maximise the number of loci we can impute, we need to impose a maximum window size equal to the length of the sequencing read used to generate the data, e.g. 100 bp to 150 bp for Illumina reads.
Expand Down
3 changes: 2 additions & 1 deletion src/base/mod.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*};
pub use self::{helpers::*, phen::*, pileup::*, structs_and_traits::*, sync::*, vcf::*};

mod helpers;
mod phen;
mod pileup;
mod structs_and_traits;
mod sync;
mod vcf;

Check failure on line 8 in src/base/mod.rs

View workflow job for this annotation

GitHub Actions / Compile

file not found for module `vcf`
5 changes: 2 additions & 3 deletions src/base/pileup.rs
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,8 @@ impl Filter for PileupLine {
}

/// Filter `PileupLine` by:
/// 1. removing nucleotide reads with counts less than `minimum coverage`, and read qualities less than `minimum quality`
/// 2. removing allele/s if the minor allele frequency is less than `min_allele_frequency`
/// 3. removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after previous filterings
/// - removing the entire locus if the locus is fixed, i.e. only 1 allele was found or retained after filterings
/// Note that we are not removing alleles per locus if they fail the minimum allele frequency threshold, only if all alleles fail this threshold, i.e. when the locus is close to being fixed
fn filter(&mut self, filter_stats: &FilterStats) -> io::Result<&mut Self> {
// First, make sure we have the correct the correct number of expected pools as the phenotype file
// TODO: Make the error pop-out because it's currently being consumed as None in the only function calling it below.
Expand Down
20 changes: 20 additions & 0 deletions src/base/structs_and_traits.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ pub struct FilePileup {
pub pool_names: Vec<String>,
}

/// The alternative entry point genotype file struct, i.e. describing the genotype data in vcf format
/// Note: Make sure that the annotate the vcf file with allele depths, e.g. `bcftools mpileup -a AD ...`
/// - `filename` - filename of the vcf file (`*.vcf` or `*.vcf.gz`)
#[derive(Debug, Clone)]
pub struct FileVcf {
pub filename: String,
}

/// The main genotype file struct used for most of the analyses
/// - `filename` - filename of the synchronised pileup file (`*.sync`)
/// - `test` - name of statistical test, i.e. "sync2csv", "fisher_exact_test", "chisq_test", "fst", "heterozygosity, "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "genomic_prediction_cross_validation""
Expand Down Expand Up @@ -78,6 +86,18 @@ pub struct PileupLine {
pub read_qualities: Vec<Vec<u8>>, // utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10)
}

/// A line of a vcf file corresponding to a single locus across all the pools
/// We are interested in extracting allele counts.
/// We are not interested in the genotype calls and their corresponding likelihoods.
#[derive(Debug, Clone, PartialEq)]
pub struct VcfLine {
pub chromosome: String, // chromosome or scaffold name
pub position: u64, // position in number of bases
pub reference_allele: char, // reference allele
pub alternative_alleles: Vec<char>, // vector of alternative alleles
pub allele_depths: Vec<Vec<u64>>, // across samples average utf8 base quality codes which can be transformed into bases error rate as 10^(-(u8 - 33)/10)
}

/// Allele counts at a locus across pools
#[derive(Debug, Clone, PartialEq)]
pub struct LocusCounts {
Expand Down
8 changes: 7 additions & 1 deletion src/imputation/adaptive_ld_knn_imputation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -314,7 +314,13 @@ impl GenotypesAndPhenotypes {
// Set missing coverages to infinity to mark imputed data
for j in 0..self.coverages.ncols() {
// Mark only the imputed loci, i.e. loci which were not completely missing across all pools
let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum});
let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| {
if x.is_nan() == false {
sum + 1
} else {
sum
}
});
if n_non_missing > 0 {
for i in 0..self.coverages.nrows() {
if self.coverages[(i, j)].is_nan() {
Expand Down
6 changes: 4 additions & 2 deletions src/imputation/filtering_missing.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@ use crate::base::*;
impl GenotypesAndPhenotypes {
pub fn missing_rate(&mut self) -> io::Result<f64> {
let (n, l) = self.coverages.dim();
let sum = self.coverages.fold(0,|sum, &x| if x.is_nan(){sum + 1}else{sum});
Ok(sensible_round(sum as f64 * 100.0 / ((n*l) as f64), 2))
let sum = self
.coverages
.fold(0, |sum, &x| if x.is_nan() { sum + 1 } else { sum });
Ok(sensible_round(sum as f64 * 100.0 / ((n * l) as f64), 2))
}

pub fn set_missing_by_depth(
Expand Down
10 changes: 8 additions & 2 deletions src/imputation/mean_imputation.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,13 @@ impl GenotypesAndPhenotypes {
// Set missing coverages to infinity to mark imputed data
for j in 0..self.coverages.ncols() {
// Mark only the imputed loci, i.e. loci which were not completely missing across all pools
let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| if x.is_nan()==false{sum + 1}else{sum});
let n_non_missing = self.coverages.select(Axis(1), &[j]).fold(0, |sum, &x| {
if x.is_nan() == false {
sum + 1
} else {
sum
}
});
if n_non_missing > 0 {
for i in 0..self.coverages.nrows() {
if self.coverages[(i, j)].is_nan() {
Expand Down Expand Up @@ -151,7 +157,7 @@ pub fn impute_mean(
let out = genotypes_and_phenotypes
.write_csv(filter_stats, keep_p_minus_1, out, n_threads)
.unwrap();

Ok(out)
}

Expand Down
15 changes: 14 additions & 1 deletion src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ use popgen::*;
long_about = "Quantitative and population genetics analyses using pool sequencing data: trying to continue the legacy of the now unmaintained popoolation2 package with the memory safety of Rust."
)]
struct Args {
/// Analysis to perform (i.e. "pileup2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute")
/// Analysis to perform (i.e. "pileup2sync", "vcf2sync", "sync2csv", "fisher_exact_test", "chisq_test", "pearson_corr", "ols_iter", "ols_iter_with_kinship", "mle_iter", "mle_iter_with_kinship", "gwalpha", "ridge_iter", "genomic_prediction_cross_validation", "fst", "heterozygosity", "watterson_estimator", "tajima_d", "gudmc", "impute")
analysis: String,
/// Filename of the input pileup or synchronised pileup file (i.e. *.pileup, *.sync, *.syncf, or *.syncx)
#[clap(short, long)]
Expand Down Expand Up @@ -204,6 +204,19 @@ fn main() {
base::pileup_to_sync,
)
.unwrap();
} else if args.analysis == String::from("vcf2sync") {
// VCF INPUT
let file_vcf = base::FileVcf {
filename: args.fname,
};
output = file_vcf
.read_analyse_write(

Check failure on line 213 in src/main.rs

View workflow job for this annotation

GitHub Actions / Compile

no method named `read_analyse_write` found for struct `structs_and_traits::FileVcf` in the current scope
&filter_stats,
&args.output,
&args.n_threads,
base::vcf_to_sync,

Check failure on line 217 in src/main.rs

View workflow job for this annotation

GitHub Actions / Compile

cannot find value `vcf_to_sync` in module `base`
)
.unwrap();
} else {
// SYNC INPUT
let file_sync = base::FileSync {
Expand Down

0 comments on commit 5ecce9b

Please sign in to comment.