The purpose of this project is to understand the effect of ancestry and admixture on autozygosity and disease burden. This is done by analyzing the distribution of disease causing variants in admixed Latin American Populations.
├── README.md
├── PCA/
├── plinkCommandsPCA.sh
├── prePCAprocess.sh
├── admixtureEntropy/
├── calculateAdmixtureEntropy.pl
├── calculateAdmixtureEntropy_unk.pl
├── autozygosity/
├── localAncestryMapping
├── 1kgpToAndyMapping.pl
├── calculateAncestryROH.pl
├── mapROHtoLocalAncestry.pl
├── NROH_SROH
├── roh.pl
├── regressionModels
├── exponentialRegression_makeplot_singlemodel.R
├── linearRegression_makeplot_singlemodel.R
├── linearRegressionModel_allPoints.R
├── linearRegressionModel.R
├── modelSelection.R
├── nonLinearRegressionModel_allPoints.R
├── dataHandling/
├── data_cleaning/
├── cleanHGMD.pl
├── data_merging/
├── addPop.pl
├── addPopID.pl
├── addPop_noHead.pl
├── joinFilesOnID.pl
├── mapAndyIds.pl
├── merge2VCFs.pl
├── mergeHGMD.pl
├── sampleID_popID.pl
├── data_subsets/
├── getHgmdSubset.pl
├── getSamplesOfInterestVCF.pl
├── hgmdSubset.pl
├── makeVCFsubset_intersect.pl
├── makeVCFsubsets.pl
├── vcfSubset_snpFreq_perIndividual.pl
├── ontology/
├── ontology_chrPos.pl
├── ontology_countSnps.pl
├── ontology_normalize.pl
├── ontology_snpFreq_nta.pl
├── ontology_snpFreq.pl
├── ontology.txt
├── ontology_zscores.pl
├── resultAnalysis/
├── ancestryComparison.pl
├── aswPelComparison.pl
├── plots/
├── PCA.html
├── PCA_NTA.html
├── generatePlots_boxplot.R
├── makePCA.R
├── ontologyPlots.R
├── pca_dataPruning.sh
├── snpFrequency/
├── individualsWithDiseaseCausingVariants.pl
├── snpFrequency_perIndividual/
├── chrPos_perIndividual_homozygous.pl
├── snpFreqPerIndividual_forPlot.pl
├── snpFreq_perIndividual.pl
├── snpFrequency_perPopulation/
├── getIntersect.pl
├── snpFrequency_joinAll.pl
├── snpFrequency_singlePop.pl
All scripts are thoroughly described in the Script Description section of the README. This section explains the general project protocol.
Stenson, Peter D., et al. "Human gene mutation database (HGMD®): 2003 update." Human mutation 21.6 (2003): 577-581.
The Human Gene Mutation Database used is located in the genomeTrax mysql database '/data/db/mysql/genomeTrax', and is entered as 'hgmd_2016'.
- cleanHGMD.pl
- awk 'BEGIN {FS = "\t"}{IGNORECASE = 1}; {if($9 ~ /autosomal recessive/ && $7 ~ /DM/ && $8 ~ /[0-9]/ && $8 >= 0.15){print $0}}' hgmd.txt > hgmd_subset.txt
- Download data of interest from the HGMD. Only entries from HG19 and considered of High Confidence were used.
- Remove entries that contain indels and entries for the X and Y chromosomes, and removes all duplicated lines.
- Make subset of interest: polyPhen2 score > 0.15, variant Type = DM, and mode of inheritance must be autosomal recessive
/data/home/bsaldana3/projects/diseaseBurden/base_data/hgmd/hgmd_complete_clean.txt
The subset with polyphen2 scores, variant type, and autsomal recessive was too stringent, at the moment we are only using variant Type = DM and and mode of inheritance must be autosomal recessive. More details about the SQL query and output file are given in the Script Description section of README
Siva, Nayanah. "1000 Genomes project." (2008): 256. Mallick, Swapan, et al. "The Simons genome diversity project: 300 genomes from 142 diverse populations." Nature 538.7624 (2016): 201.
- The VCF files from the 1KGP used for this project were located in the '/data/home/data/100gp/vcfgz/' directory in the Jordan Lab 'The Beast' server, and VCF files used were the ones with this title: 'ALL.chr*.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'.
- The VCF file from the SGDP used for this project was located in the '/data/shashwat/SGDP_data/' directory in the Jordan Lab 'The Beast' server, and VCF files used were the ones with this title: 'mergedSimple_sgdp.vcf.gz'.
The initial VCF files contained all 1KGP individuals and were divided into different chromosomes. So I made a single VCF with only the samples of interest.
- getSamplesOfInterestVCF.pl # To make subsets from Chromosome separated multi VCFs
- cat # concatenate results from getSamplesOfInterestVCF.pl
We needed multi-VCF files for the following 1KGP populations: ACB, ASW, CLM, MXL, PEL, PUR, GRB, IBS, ESN, YRI Ideally there should only be one VCF file containing all samples of interest. Currenly some scripts work with multiple files, others with a single file.
This part of the project was done by Dr. Andrew Conley.
- calculateAdmixtureEntropy.pl
- calculateAdmixtureEntropy_unk.pl
The script calculates de admixture entropy of all individuals using the following formula:
- plink --vcf path/to/vcfFile.vcf --homozyg
- akw 'BEGIN{OFS="\t"}{print $2, $7, $8}' plink.hom
- roh.pl
Plink is used to calculate the Runs of Homozygosity from a vcf file. The plink output file sometimes is missing fields, so the awk command simply gets the fields needed for NROH and SROH analysis, so that the roh.pl script does not have to account for missing fields. roh.pl calculates the NROH and SROH per individual.
- all scripts in the autozygosity/regressionModels directory
The purpose of this analysis was to attempt to find a statistically significant relationship between Ancestry and Admixture and Autozygosity. Both linear and non-linear models were built.
- 1kgpToAndyMapping.pl (map 1kgp IDs to Local Ancestry IDs)
- calculateAncestryROH.pl (calculate SROH of populations, per local ancestry)
- mapROHtoLocalAncestry.pl
First you must map the 1KGP IDs to the IDs in Andy's Local Ancestry results. Then you can run the mapROHtoLOcalAncestry,pl script to map each ROH to a local ancestry. If no single ancestry maps to the ROH, then UNK is assigned to that ROH. After the ROHs have been mapped, you can use the calculateAncestryROH.pl script to calculate the total SROH of each population by the local ancestry it maps to.
- HGMD Variants Per Individual
- Individuals With Disease Causing Variants
- Disease Causing Variants per Individual
- SNP Frequency per Population
- Disease Ontology Curation
- Disease Ontology VCF subsets
- Disease Ontology Frequency Calculation
- Disease Ontology Trait/Disease to Geomic Position
- Disease Ontology Result Normalization
- Disease Ontology Statistical Analysis
- Disease Ontology Finding Categories of Interest
- Disease Ontology Plots
- snpFreq_perIndividual.pl
- snpFreqPerIndividual_forPlot.pl
- generatePlots_boxplots.R
The script counts the number of variants that are present in the HGMD from each individual. The script compares the position of each variant described in the multi-VCF files with the data in the HGMD database. If the disease-causing allele of the variant is present in the individual's genome, it is counted in the following way:
- Homozygous absent: 0|0 --> +0
- Heterozygous present: 0|1 or 1|0 --> +0
- Homozygous present: 1|1 --> +1
The script outputs one individual per row, with the total sum of disease causing variants present in the individual's genome. If there is incongruence with the number of typed-positions per individual, the total sum must be divided by the total-typed positions. We did not have to do that since our data was consistent and used vcf subsets.
Older versions of the script calculated Homozygous present as (+2), and Heterozygous present as (+1). Look at the last for loop to check if the variable $homozygous is being divided by 2 (as is should be). snpFreqPerIndividual_forPlot.pl simply organizes the data in a way that makes it easier to make the type of plot that we wanted.
- individualsWithDiseaseCausingVariants.pl
The script outputs all variants present in both the VCF files and the HGMD, one variant per row, and next to the variant information is a list of the individuals that have the disease-causing allele of the variant (heterozygous or homozygous).
- chrPos_perIndividual_homozygous.pl
This script outputs a list of variants chr:pos that each individual has present in their genome in homozygous form. Used for making subsets in the past, this script is no longer used, but I have kept it incase we need this information.
- snpFrequency_singlePop.pl
- getIntersect.pl
- snpFrequency_joinAll.pl
The snpFrequency_singlePop.pl scripts counts the number of individuals per populaiton that have the disease-causing allele of the variant present in their genome in homozygous form. The sum is calculated in the following way:
- Homozygous present: 1|1 --> +1 The scripts outputs all variants present in both the VCF files and the HGMD, one variant per row, and next to the variant information is sum of the frequency of each variant per population.
This used to be a single script. But since we are using data from the 1KGP and SGDP we wanted to avoid accidentally counting extra variants for the different populations.
An ontology was created by classifying the traits/diseases in the HGMD into hierarchical categories, by hand. No script, sorry.
- vcfSubset_snpFreq_perIndividual.pl
This script makes subsets of the VCF with diseases that are only present in the 'Disease Risk' category of the Disease Ontology or 'Disease Protection' category of the Disease Ontology. This is helpful for faster frequency calculations or if we need to re-do the analysis.
- ontology_snpFreq.pl
The script first appends the frequency of each trait/disease per population to the ontology, and then sums up the totals per ontology category. The frequency is taken from the output of the snpFrequency.pl script.
- ontology_chrPos.pl
The script first appends the genomic position of variant associated with every HGMD trait/disease, in the chromosome:position format, and then makes a list of all variants per ontology category.
- ontology_normalize.pl
The script normalizes the results of ontology_snpFreq.pl by the population size and by the number of SNPs per ontology category.
- ontology_zscores.pl
The script calculates the z-scores of every ontology category. The population standard deviation is used for this calculation.
- ancestryComparison.pl
- aswPelComparison.pl
The scripts compares the results of the ontology based on ancestral populations and admixed populations. of the ancestral populations, and if there is a significant difference, it will output the categories of interest. The "significance" parameter is inuted by the user (percent difference). We decided to compare ancestral populations to avoid imposing our hypothesis onto the results.
- ontologyPlots.R
The script generates column (bar) plots for the "categories of interest" in the ontology.
SELECT
ngs_feature_2016.chromosome,
ngs_feature_2016.feature_start,
ngs_feature_2016.rsid,
hgmd_2016.ref,
hgmd_2016.alt,
hgmd_2016.disease,
hgmd_2016.variantType,
dbnsfp.dbNSFP_Polyphen2_score,
orpha.orpha_inheritance
FROM ngs_feature_2016
INNER JOIN hgmd_2016 ON ngs_feature_2016.ngs_feature_no=hgmd_2016.feature_no
INNER JOIN dbnsfp ON ngs_feature_2016.ngs_feature_no=dbnsfp.feature_no
INNER JOIN orpha ON hgmd_2016.ensembl=orpha.ensembl
WHERE ngs_feature_2016.genome='hg19'
AND hgmd_2016.confidence='high'
INTO OUTFILE 'path/to/hgmd.txt';
Chromosome | Position | RSID | Ref | Alt | Disease/Trait | variantType | PolyPhen2_score | mode_of_inheritance |
---|
You can filter out what you do not want (variantType=DM, polyPhen_score>0.15, inheritance=autosomal recessive) with an awk command.
./cleanHGMD.pl /path/to/hgmd.txt /path/to/outfile.txt
HGMD input file must be in the format depicted in the section above
Output will be in the same format but with less lines
zcat file.vcf.gz | ./getSamplesOfInterestVCF.pl listOfSampleIDs.txt vcf_outfile.txt
The script will read the output of zcat from the command line, so no need to sirectly input a vcf file. You will need to make a list of the sample IDs that correspond to the headder of the VCF file.
A VCF file with only the columns of the samples of interest.
./snpFreq_perIndividual.pl /path/to/VCFfile.vcf /path/to/hgmd.txt /path/to/output.txt
VCF file must be a tab-separated multi-VCF file, one variant per line, array of individuals with a '|' separating alleles HGMD input file must be in the format depicted in the section above
SAMPLE_ID | HETEROZYGOUS | HOMOZYGOUS | TOTAL_HGMD_VARIANTS | TOTAL_TYPED_POSITIONS |
---|
-
SAMPLE_ID: The ID of the individual in the VCF file
-
HETEROZYGOUS: total disease-causing HGMD variants present in the individual in heterozygous form calculation: R|A = +1 , A|R = +1
R = Reference Allele (usually depicted as 0)
A = Alternate Allele (usually depicted as 1) -
HOMOZYGOUS: total disease-causing HGMD variants present in the individual in homozygous form calculation: A|A = +1
R = Reference Allele (usually depicted as 0)
A = Alternate Allele (usually depicted as 1) -
TOTAL_HGMD_VARIANTS: total number of disease-causing variants present, no matter how they are present calculation: R|A = +1 , A|R = +1 , A|A = +2
R = Reference Allele (usually depicted as 0)
A = Alternate Allele (usually depicted as 1) -
TOTAL_TYPED_POSITIONS: All positions in the VCF file that were typed, per individual calculation: X|X = +2 , X|. = +1 , .|X = +1
X = Anything, except a blank
./individualsWithDiseaseCausingVariants.pl path/to/file.vcf path/to/outfile.txt path/to/hgmd.txt
- VCF file (one population at a time)
- Outfile name
- HGMD file
CHROMOSOME | POSITION | RSID | REF | ALT | LIST OF SAMPLE IDS |
---|
- LIST OF SAMPLE IDS: the list of individuals that have the variant present
./chrPos_perIndividual_homozygous.pl path/to/vcfFile.vcf path/to/hgmd.txt path/to/outfile.txt
- VCF file
- HGMD text file
SAMPLE_ID | LIST_OF_CHR:POS |
---|
./snpFrequency_singlePop.pl path/to/file.vcf path/to/outfile.txt path/to/hgmd.txt
- VCF file (one population at a time)
- Outfile name
- HGMD file
CHROMOSOME | POSITION | RSID | REF | ALT | DISEASE | FREQUENCY |
---|
Script to get the intersect of two files in order to be able to join them with the next script.
./snpFrequency_joinAll.pl path/to/fileList.txt path/to/VCFintersect.txt path/to/outfile.txt
- File list: a list of the files containing the results of snpFrequenc_singlePop (including paths)
- Outfile name
CHROMOSOME | POSITION | RSID | REF | ALT | DISEASE | POPULATION 1 | POPULATION 2 | ... |
---|
- POPULATION X: one column per population, the column contains the frequency of each variant
./ontology_snpFreq.pl path/to/snpFrequency_joinAll_results.txt path/to/ontology.txt path/to/outfile.txt
- Results of the snpFrequency_joinAll.txt script
- Ontology text file
- Outfile name
KEY | DISEASE | POPULATION 1 | POPULAITON 2 | ... |
---|
- KEY: The ontology key that corresponds to the disease
- DISEASE: Disease/category name
- POPULATION X: Frequency of each population of the ontology disease/category
./ontology_chrPos.pl path/to/snpFrequency_joinAll_results.txt path/to/ontology.txt path/to/outfile.txt
- Results of the snpFrequenct_joinAll.txt script
- Ontology text file
- Outfile name
KEY | DISEASE | CHR:POS |
---|
- KEY: The ontology key that corresponds to the disease
- DISEASE: Disease/category name
- CHR:POS: Lis of chr:pos that correspond to each disease or disease category in the ontology
Before running this script you must make two subsets of the result file of snpFrequency_joinAll.pl script. One subset of the variants that are in the Disease Risk category of the ontology and another for the variants in the Disease Protection category. You can use the result file of the ontology_chrPos.pl script.
./vcfSubset_snpFreq_perIndividual.pl path/to/vcfFile.vcf path/to/snpFrequency_joinAll/subset path/to/outfile.txt
- VCF file
- Subset of snpFrequency_joinAll result
VCF file containing only entries in the disease ontology Disease categories.
./diseaseBurden_shared/ancestryComparison.pl path/to/normalized/ontology_snpFreq.txt difference path/to/outFile.txt
- Result of the ontology_snpFreq.pl (normalized by population size)
- desired difference
Ontology lines where the ancestry difference is larger than the inputed difference
The ontology categories of interest are hard coded in the script.
- make sure all scripts used are described on README
- Create script to find categories of interest at least two nodes above the terminal nodes