This repository contains scripts to:
- Generate reference variant sets for all genes with VARITY predictions
- Add additional annotations to CSV files with log-likelihood ratios (LLRs) at 0.1 VARITY score intervals
- Cluster genes based on inheritance and functional activity, and attempt to build a naive bayes model to classify genes based on their LLR curves
To generate reference variant sets for genes with VARITY predictions, you will need:
- A ClinVar variant summary file (which can be found at https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/)
- A GnomAD variant summary file (which can be found at https://gnomad.broadinstitute.org/downloads/)
- A list of VARITY supported IDs (from http://varity.varianteffect.org/)
- All VARITY predictions (from http://varity.varianteffect.org/)
NOTE: The GnomAD file must be filtered to speed up the time it takes to make reference sets. To do this:
- Change the name of gnomad_file to the path of your input GnomAD variant summary file and filtered_gnomad_file to your desired output name
- Run
python filter_gnomad.py
You can run the script varity_LLRs.py to generate reference sets.
Required arguments:
--clinvar Path to ClinVar variant summary file
--gnomad Path to filtered GnomAD variant summary file
--varity_ids Path to list of VARITY supported IDs
--varity_predictions Path to VARITY predictions
-o, --output Output directory where reference sets should be written
Optional arguments:
--num_hom GnomAD homozygote threshold (default: 1)
--min_ref Minimum number of reference pathogenic and benign variants (default: 11)
--maf_cutoff MAF threshold to be considered benign (default: 0.001)
Example usage:
python varity_LLRs.py --clinvar data/variant_summary.txt --gnomad data/gnomad_frequencies.vcf.gz --varity_ids data/varity_supported_ids.txt --varity_predictions data/varity_all_predictions.txt --output reference_sets --num_hom 2
Once the reference sets are produced, you can modify the calcLLR.R script within TileseqMave to output LLRs at 0.1 VARITY score intervals into a single output called LLR_intervals.csv. To do so, add the following code to the end of calcLLR.R and update TileseqMave locally.
interval_scores = sapply(seq(0,1,0.1), llrObj$llr)
cat(paste(strsplit(outprefix, "_")[[1]][1], ","), file="LLR_intervals.csv", append=TRUE)
cat(interval_scores, file="LLR_intervals.csv", sep=",", append=TRUE)
cat("\n", file="LLR_intervals.csv", append=TRUE)
You can then write a quick script such as the one below to run this for all genes with reference sets:
REFDIR=path_to_directory_with_reference_sets
REFS=($(ls $REFDIR | grep reference.csv))
cd $REFDIR
for REF in ${REFS[@]}
do
GENE=$(echo $REF | sed 's/_reference.csv//')
MAP=$(echo $REF | sed 's/_reference.csv/_map.csv/')
tsm calcLLR $MAP $REF --kernel gaussian #can change parameters here as needed
done
Additional information can be added to LLR_intervals.csv, such as the mode of inheritance of diseases associated with genes in the file or functional annotations for each gene. Two scripts have been written to add such annotations: add_functional_annotations.py and add_inheritance_annotations.py.
Example add_inheritance_annotations.py usage:
python add_inheritance_annotations.py -i path_to_input_csv -g path_to_gencc_file -o output_filename
- path_to_input_csv refers to the path of LLR_intervals.csv
- path_to_gencc_file refers to a GenCC TSV annotation file (found here https://search.thegencc.org/download)
NOTE: Modify the GenCC file to only include the 'gene_symbol', 'disease_title', 'classification_title' and 'moi_title' columns
- output_filename refers to the desired output filename
Example add_inheritance_annotations.py usage:
python add_functional_annotations.py -i path_to_input_csv -g path_to_david_annotation_cluster_file -o output_filename
- path_to_input_csv refers to the path of LLR_intervals.csv
- path_to_david_annotation_cluster_file refers to the DAVID functional annotation output file (can be made here: https://david.ncifcrf.gov/tools.jsp)
- output_filename refers to the desired output filename
The script clusterLLR.R takes LLR_intervals.csv and will attempt to cluster the genes based on their LLR curves and output PCA and UMAP plots
Additionally, if provided with an LLR_intervals.csv file with additional annotations, it attempts to build a classification model using naive bayes to see if VARITY performs better for certain subgroups of genes (ex. inheritance, different functional sub classes of genes). You can change the input LLR_intervals.csv file directly within the script.