Skip to content

Latest commit

 

History

History
1143 lines (798 loc) · 52.1 KB

README.md

File metadata and controls

1143 lines (798 loc) · 52.1 KB

Outlier Analysis Workshop

There are lots of interesting patterns that you can extract from genetic variant data. This can include patterns of linkage, balancing selection, or even inbreeding signals. One of the most common ones is to try find sites on the genome that are under selection. The following vignette will take you through the basics of genetic selection analysis.

The project has been funded by the AES ERC Networking Grant Scheme and GSA.

This workshop is a mixture of Bash and R. A version of this workshop that has been adapted to run directly on New Zealand eScience Infrastructure (NeSI) is available here. The code is currently set up for running on Pawsey, and where the code indicates R please run the lines in R , while q() indicates that you should leave the R environment.

Old schedule (used in previous workshops)

Day 1
9:00am Introduction Slides
9:30am Download data
10:00am Morning Tea
10:15am PCAdapt
12:00pm Lunch
1:00pm VCFtools
2:00 Afternoon tea
2:15pm VCFtool continued and setup for Bayescan and Baypass

Day 2
9:00am Bayescan
10:00am Morning Tea
10:15am Bayescan continued & Bayepass
12:00pm Lunch
1:00pm Baypass
2:00pm Afternoon tea
2:15pm Compiling results, group discussiona and metanalysis contribution

List of bash commands used in this workshop

mkdir — Create a directory
cd — Change directory
cp — copy files and directories
ls — List directory contents
head — Read the start of a file
less — view the contents of a text file
grep — search
| — Pipe
cut — cut line parts from specified files
awk — pattern scanning and processing
wc — word count
tr — deletes or substitutes characters
sed — modifies lines
echo — Prints text to the terminal window
cat — Read a file, create a file, and concatenate files
comm — returns three columns depending on the uniqueness of the data

A brief introduction to Genetic Outlier and Association Analysis

When we look through a genome to try to find loci that are under selection, common apporaches are outlier analysis and association analyses. For outlier analysis we are looking for sites in the genome that demonstrate significantly higher or lower within- or among-population genetic differentiation than expected under neutrality. Outlier analysis requires just knowledge of the genetics of your samples (plus occasionally sample metadata, for example, population groupings) and tries to find loci that behave very differently from the underlying patterns across the genome (with the assumption being that the rest of the genome represents patterns of neutral genetic diveristy)A. Meanwhile association analysis requires some sort of covariate data, and tests whether there are any genetic variants statistically associated with this extra data (you may have heard the term GWAS). This covariate data can come in the form of phenotype data (e.g., morphology, disease status, physiology measures) or could be spatial (e.g., environmental, climate). Association tests look for sites in the genome where the presence or absence of a variant is highly correlated with the values in the co-variate data, usually through some regression-type analysis.

Throughout this vignette, I will collectively refer to outlier and association analysis as selection analysis. There are a lot of programs that exist currently. You can find a very long (though not exhaustive) list here.

This vignette will start by covering some very simple outlier analyses:

  • Within group outliers: PCAdapt can detect genetic marker outliers without having populationB designations using a Principle Component Analysis (PCA) approach.
  • Between group outliers (pairwise): FST outlier analysis is an approach that uses pairwise comparisons between two populations and the fixation index metric to assess each genetic marker. This approach does not provide any correction for population strucutre.

Next, we will conduct some more advanced outlier analyses:

  • Between group outliers (3+ groups): Bayescan looks for differences in allele frequencies between populations to search for outliers using bayesian statistics. This approach does not provide any correction for population strucutre.
  • Between group outliers (3+ groups) and association analysis: Baypass elaborates on the bayenv model (another popular association analysis program) and allows you to conduct many different types of genetic outlier and genetic association tests, while attempting to account for population strucutre.

We will cover the pre-processing of program-specific input files, how to run the programs, how to visualise the output, and in some cases we'll need to take extra steps to map the genetic markers of interest back to the SNP data.

🔰 Reduced representation versus whole genome sequencing (WGS)

Outlier analysis is often done on reduced representation data. It is important to remember how your genome coverage (the number of genome variant sites / the genome length C) will affect your results and interpretation. Often with WGS data, you will see well -resolved 'peaks' with a fairly smooth curve of points leading up to it on either side. From this, we often infer that the highest point is the genetic variant of interest and the sites on either side of the peak exhibit signals of selection because they reside close to, and thus are linked to, the true variant of interest. However, consider that even in WGS data, unless we have every single genetic variant represented (which may not be the case, depending on our variant calling and filtering parameters), it is possible that the genetic variant of interest that we have identified is not the main one, but is simply a neighboring SNP close to one that is not represented in the data. This problem becomes even more relevant with reduced representation sequencing (RRS), for which the genome coverage may be extremely patchy C. Thus with all outlier analysis, but especially so for those using RRS data, remember that your flagged outliers are not exhaustive and may themselves only be liked to the variant that is truly under selection.

Manhattan plot of the association between Fst and loci along a genome. Manhattan plot example from https://doi.org/10.1111/mec.17195

Define your working directory for this project, and the VCF file location:

mkdir /home/ubuntu/outlier_analysis
DIR=/home/ubuntu/outlier_analysis
cd $DIR

Our data tree will look like this:

outlier_analysis/
├── analysis
│ ├── bayescan
│ ├── baypass
│ ├── pcadapt
│ ├── summary
│ └── vcftools_fst
├── data
├── programs
└── workshop_material

So let's set up our directories to match this

mkdir -p {analysis/{bayescan,baypass,pcadapt,summary,vcftools_fst},data,programs,workshop_material}

Project data

The data provided in this workshop contains 5007 SNPs loci for 39 individuals (13 individuals each from 3 different locations). This data has some missingness (i.e., missing SNP calls).

There is also a metadata file containing the individual's unique IDs, assigned populations, and a wingspan measurement for each individual.

Let's grab this data from the project's git repository, place the data files into our data directory, and define the environmental variables VCF and METADATA with the locations of the genetic variant and metadata files respectively.

# Enter data directory
cd $DIR/workshop_material

# Download required files to data directory
git clone https://github.com/katarinastuart/Ev1_SelectionMetaAnalysis.git

#place the VCF and METADATA files into our data folder
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/* $DIR/data

# Set environment variables
VCF=$DIR/data/starling_3populations.recode.vcf
METADATA=$DIR/data/starling_3populations_metadata.txt

# Check that this has worked
head $VCF
head $METADATA

Working with your own data

Alternatively, you can also use your own data for this workshop. If so, it is a good idea to thin your SNP dataset down to roughly 5,000 SNPs to ensure compute times are not too long. If you have more than 50 individuals, you may also want to reduce this. If you would like to do this, place your genetic variant and metadata file in the data directory and define VCF and METADATA based on their names.

🔰 Filtering your own data

Remember to apply the usual population genomic filters to your data before you proceed with the data thinning/subsetting. Exactly what type of data filtering is needed will depend on your data, however some standard filtering steps are to 1) filter out genotype calls based on minimum quality, and a minimum and maximum depth, 2) filter out individuals (columns) and SNPs (rows) that have high (50% or more) missingness, 3) minor allele frequency or minor allele count. For many of these types of analysis, I would not thin the SNP data for linkage disequilibrium (LD) or physical distance, as clusters of linked SNPs are useful and will help you spot interesting outlier regions in the genome. One exception to this though is PCAdapt, for which you may want to consider thinning based on LD.

If you are working on your own data: checking metadata file order

Check that your VCF file and metadata file have individuals in the same order - it will make your future work a lot easier.

module load quay.io/biocontainers/bcftools/1.17--h3cc50cf_1/module

bcftools query -l $VCF > sample_ordering.txt

#also check that you have SNP IDs. If not, we can add them
bcftools annotate --set-id +'%CHROM\_%POS' your_filename_prefix.recode.vcf -o your_filename_prefix_ID.recode.vcf

R

setwd("/home/ubuntu/outlier_analysis/data")

sample_ordering <- read.table("sample_ordering.txt", sep="\t", header=FALSE)
colnames(sample_ordering) <- c("sampleID")

metadata <- read.table("your_filename_prefix_metadata.txt", sep="\t", header=TRUE)

#find the columns with the individual IDs in them, and merge 
reordered <- merge(sample_ordering, metadata, by.x = "sampleID", by.y = "column1", sort=FALSE) 

write.table(reordered, file = "your_filename_prefix_metadata_reordered.txt", sep="\t", quote = FALSE, row.names=FALSE, col.names=FALSE)

If you are working on your own data: setting environmental variables

You will need to set you own data files as the VCF and METADATA environmental variables. If you sent me your two data files, you can find them at the below address - just make sure to update the name of the files so that it matches the one you sent me ahead of time!

# Copy your files to the the data folder
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/learner_data/your_filename_prefix* $DIR/data

# Set environment variables
VCF=$DIR/data/your_filename_prefix.recode.vcf
METADATA=$DIR/data/your_filename_prefix_metadata.txt

# Check that this has worked
head $VCF
head $METADATA

Across this workshop, we will need the genetic data to be in several different formats. Let's prepare that now. First we convert the VCF to PLINK, and then to BED.

# Enter data directory
cd $DIR/data

# Load modules
module load quay.io/biocontainers/vcftools/0.1.15--he941832_2/module
module load quay.io/biocontainers/plink/1.90b6.21--hec16e2b_2/module

# Convert VCF to PLINK
vcftools --vcf $VCF --plink --out starling_3populations.plink

# Convert PLINK to BED
plink --file starling_3populations.plink --make-bed --noweb --out starling_3populations

PCAdapt

PCAdapt uses an ordination approach to find sites in a data set that are outliers with respect to background population structure. The PCAdapt manual is available here, and a lot of the code below has been taken from this manual. If you want to explore this program more I highly recommend going through the full tutorial.

Citation: Privé, F., Luu, K., Vilhjálmsson, B. J., & Blum, M. G. B. (2020). Performing Highly Efficient Genome Scans for Local Adaptation with R Package pcadapt Version 4. Mol Biol Evol, 37(7), 2153-2154. https://doi.org/10.1093/molbev/msaa053

First, let's install PCAdapt and set your working directory.

R

library("pcadapt")

setwd("/home/ubuntu/outlier_analysis/analysis/pcadapt/")

Now let's load in the data - PCAdapt uses bed file types.

starling_bed <- "/home/ubuntu/outlier_analysis/data/starling_3populations.bed"
starlings_pcadapt <- read.pcadapt(starling_bed, type = "bed")

Produce K-plot.

starlings_pcadapt_kplot <- pcadapt(input = starlings_pcadapt, K = 20)
pdf("pcadapt_starlings_kplot.pdf")
plot(starlings_pcadapt_kplot, option = "screeplot")
dev.off()

k plot

A K value of 3 is most appropriate, as this is the value of K after which the curve starts to flatten out more. This means we have identified the PC's that capture population structure.

starlings_pcadapt_pca <- pcadapt(starlings_pcadapt, K = 3)
summary(starlings_pcadapt_pca)

✔️ output
        Length Class Mode
scores   117 -none- numeric
singular.values   3 -none- numeric
loadings   15021 -none- numeric
zscores   15021 -none- numeric
af   5007 -none- numeric
maf   5007 -none- numeric
chi2.stat   5007 -none- numeric
stat   5007 -none- numeric
gif   1 -none- numeric
pvalues   5007 -none- numeric
pass   4610 -none- numeric

Investigate axis projections, to check that below the elbow or bend in the K plot we have no more population structure.

poplist.names <- read.delim("/home/ubuntu/outlier_analysis/data/starling_3populations_metadata.txt", header=FALSE)[,2]
print(poplist.names)

pdf("pcadapt_starlings_projection1v2.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 1, j = 2, pop = poplist.names)
dev.off()

pdf("pcadapt_starlings_projection5v7.pdf")
plot(starlings_pcadapt_kplot, option = "scores", i = 5, j = 7, pop = poplist.names)
dev.off()

Ignore the warning:

❗ Use of df$Pop is discouraged. Use Pop instead.

projection axis1 axis2 projection axis6 axis7

🔰 No population groupings? No problem! Colouring them by latitude or longitude may also help check that you have captured the population structure.

Investigate Manhattan and Q-Qplot.

🔰 Manhattan plots are a way to visualize the GWAS (genome-wide association study) p-values (or other statistical values) at each SNP locus along the genome

🔰 Q-Qplots plots are a quick way to check if your residuals are normally distributed, and to see what the tail of outliers look like. Check out more information here.

pdf("pcadapt_starlings_manhattan.pdf")
plot(starlings_pcadapt_pca, option = "manhattan")
dev.off()

pdf("pcadapt_starlings_qqplot.pdf")
plot(starlings_pcadapt_pca, option = "qqplot")
dev.off()

Manhattan Q-Qplot

Plotting and adjusting the p-values

starling_pcadapt_pvalues <- as.data.frame(starlings_pcadapt_pca$pvalues)

library("ggplot2")

pdf("pcadapt_starlings_pvalues.pdf")
hist(starlings_pcadapt_pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

pvals

starlings_pcadapt_padj <- p.adjust(starlings_pcadapt_pca$pvalues,method="bonferroni")
alpha <- 0.1
outliers <- which(starlings_pcadapt_padj < alpha)
length(outliers)

write.table(outliers, file="starlings_pcadapt_outliers.txt") 

✔️ Output
  [1] 3

After this, we will jump out of R and back into the command line by using the following command:

q()

Mapping Outliers: PCAdapt

Find the SNP ID of the outlier variants.

cd $DIR/analysis

The first thing we will do is create a list of SNPs in VCF, and then assign line numbers that can be used to find matching line numbers in outliers (SNP IDs are lost in PCadapt & Bayescan, line numbers are used as signifiers).

We create this in the analysis directory because we will use it for more than just mapping the outlier SNPs for PCAdapt, we will also need it on day 2 for BayeScan and BayPass.

grep -v "^#" $VCF | cut -f1-3 | awk '{print $0"\t"NR}' > starling_3populations_SNPs.txt

Now let us jump back into the pcadapt directory to continue working with our outliers. We select column 2 of the outlier file using the AWK command, which contains the number of outliers.

cd $DIR/analysis/pcadapt
awk '{print $2}' starlings_pcadapt_outliers.txt > starlings_pcadapt_outliers_numbers.txt

Make a list of outlier SNPS ID's.

awk 'FNR==NR{a[$1];next} (($4) in a)' starlings_pcadapt_outliers_numbers.txt ../starling_3populations_SNPs.txt   | cut -f3 > pcadapt_outlierSNPIDs.txt
head pcadapt_outlierSNPIDs.txt

✔️ Output

230955:72:-
238881:46:+
286527:46:-

VCFtools windowed Fst

The VCFTools manual is available here.

Citation: Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R., & Genomes Project Analysis, G. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158. https://doi.org/10.1093/bioinformatics/btr330

Fst outliers help us to identify SNPs that behave abnormally in pairwise comparisons between populations.

We first need to use our metadata file (currently defined by the environmental variable METADATA) to make three individual files containing only the list of individuals in each population. We can do this by subsetting our sample metadata file, using the grep command to select lines that match each population's name, and then using awk to keep only the first column of metadata, i.e., the sample names.

# Load modules
module load quay.io/biocontainers/vcftools/0.1.15--he941832_2/module
# Enter data directory
cd $DIR/data

# Subset metadata
grep "Lemon" $METADATA | awk '{print $1}' > individuals_Lemon.txt
grep "War" $METADATA | awk '{print $1}' > individuals_War.txt
grep "Nowra" $METADATA | awk '{print $1}' > individuals_Nowra.txt

Now we can pick two populations to compare. Let's work with Lemon (short for Lemon Tree, QLD, AU) and War (short for Warnambool, VIC, AU) and perform an SNP-based Fst comparison.

cd $DIR/analysis/vcftools_fst

vcftools --vcf $VCF --weir-fst-pop $DIR/data/individuals_Lemon.txt --weir-fst-pop $DIR/data/individuals_War.txt --out lemon_war

head -n 5 lemon_war.weir.fst

✔️ Output
  CHROM POS WEIR_AND_COCKERHAM_FST
starling4 107735 0.160891
starling4 137462 -0.0805785
starling4 151332 0.0524489
starling4 227887 0.0569961

The important column is column 3: WEIR_AND_COCKERHAM_FST, from Weir and Cockerham’s 1984 publication.

wc -l lemon_war.weir.fst 

✔️ Output
  5008

Notice how there are as many lines as there are SNPs in the data set, plus one for a header. It is always a good idea to check your output to ensure everything looks as expected!

Next, instead of calculating pairwise population differentiation on an SNP-by-SNP basis, we will use a sliding window approach. The --fst-window-size 50000 refers to the window size of the genome (in base pairs) in which we are calculating one value: all SNPs within this window are used to calculate Fst. The --fst-window-step option indicates how many base pairs the window is moving down the genome before calculating Fst for the second window, then the third, and so on. In genetal 50000 bp is a good size of sliding window, and the absolutely maximum size of sliding windows in my option should be 1 Mb, but even this window size is very big.

Warning about sliding windows
  These sliding windows only work on ordered SNPs on the same chromosome/scaffold/contig. If your data is not set up like this (i.e., all your SNPs are on a single pseudo-chromosome), then this method is not appropriate for your data, as it will make an assumption about where the SNPs are located with respect to one another.

vcftools --vcf $VCF --fst-window-size 50000 --fst-window-step 10000 --weir-fst-pop $DIR/data/individuals_Lemon.txt --weir-fst-pop $DIR/data/individuals_War.txt --out lemon_war

head -n 5 lemon_war.windowed.weir.fst

✔️ Output
  CHROM BIN_START BIN_END N_VARIANTS WEIGHTED_FST MEAN_FST
starling4 60001 110000 1 0.160891 0.160891
starling4 70001 120000 1 0.160891 0.160891
starling4 80001 130000 1 0.160891 0.160891
starling4 90001 140000 2 -0.00374291 0.0401563

Notice the output is different.

wc -l lemon_war.windowed.weir.fst 

✔️ Output
  10838

Notice the line count is different from the SNP-based Fst comparison; there are more lines in the sliding window-based Fst comparison. This is because there are more sliding windows across the chromosome in this data set than there are SNPs. Consider which of these steps is better for your data: the sliding window approach might not be the best in low-density SNP datasets like this one. But we will proceed with the sliding window analysis so that we can demonstrate the code needed to identify the SNPs within outlier windows.

Now let us plot the Fst across the chromosome. To do this, we will add line numbers on our Fst file that will be used to order the Fst measurements across the x-axis of our Manhattan plot.

🔰 X-axis values in the following plot are done using each outlier window's line number, as they are in order along the genome. Outlier windows are equally spaced, so line numbers are sufficient to capture the patterns along the genome. Consider that if you are plotting Fst values for SNPs (rather than windows), they may not be equally spaced along the genome, so SNP position may need to be used to make your Manhattan plots.

awk '{print $0"\t"NR}' ./lemon_war.windowed.weir.fst  > lemon_war.windowed.weir.fst.edit

R

library("ggplot2")

setwd("/home/ubuntu/outlier_analysis/analysis/vcftools_fst")

windowed_fst <- read.table("lemon_war.windowed.weir.fst.edit", sep="\t", header=TRUE)
str(windowed_fst)

quantile(windowed_fst$WEIGHTED_FST, probs = c(.95, .99, .999))

✔️ Output
  95% 99% 99.9%
0.1948850 0.3501600 0.5741306

Choose the quantile threshold above which SNPs will be classified as outliers. In the code block below, we chose 99% (i.e., the top 1% of SNP windows).

pdf("fst_starlings_windowed.pdf", width=10, height=5)
ggplot(windowed_fst, aes(x=X1, y=WEIGHTED_FST)) + 
  geom_point() + 
  geom_hline(yintercept=0.35, linetype="dashed", color = "red")+
  labs(x = "Window Number") +
  theme_classic()

dev.off()

q()

Windowed Fst

Finally, we will generate a list of outlier SNP IDs. We do this by selecting all SNPs located in the outlier windows.

cd $DIR/analysis/vcftools_fst
cat lemon_war.windowed.weir.fst.edit | awk '$5>0.3501600' > lemon_war.windowed.outliers
wc -l lemon_war.windowed.outliers

✔️ Output
  107 lemon_war_fst.windowed.outliers

#Identify the regions of the genome that were found to be outliers and subset the VCF file
cut -f1-3 lemon_war.windowed.outliers > lemon_war.windowed.outliers.bed 
vcftools --vcf $VCF --bed lemon_war.windowed.outliers.bed --out fst_outliers --recode

#Create a list of outlier SNP IDs
grep -v "#" fst_outliers.recode.vcf | awk '{print $3}' > vcftoolsfst_outlierSNPIDs.txt
wc -l vcftoolsfst_outlierSNPIDs.txt

✔️ Output
  61

We have a total of 61 outlier SNPs locate across 107 outlier SNP windows.

Bayescan

The BayeScan manual is available here.

Citation: Foll, M., & Gaggiotti, O. (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: A Bayesian perspective. Genetics, 180(2), 977-993. https://doi.org/10.1534/genetics.108.092221

BayeScan identifies outlier SNPs based on allele frequencies.

Prepare your terminal session:Since you are likely using a new terminal session today, you will need to set the environment variables again to use them. Copy and paste the following command into your terminal to ensure you reference the correct variables. Remember, if you are using your own data these commands will be a bit different, as they will point to a different data and metadata file.

DIR=/home/ubuntu/outlier_analysis
VCF=$DIR/data/starling_3populations.recode.vcf
METADATA=$DIR/data/starling_3populations_metadata.txt

First, we will need to convert out VCF to the Bayescan format. To do this we will use the genetic file conversion program called PGDspider.

We also need to create a new populations metadata file containing individual names in column 1 and population names in column 2.

cd $DIR/data
cut -f1,2 $METADATA > starling_3populations_metadata_INDPOP.txt

Now, navigate to the directory where we will run our Bayescan analysis.

cd $DIR/analysis/bayescan

We now run PGDSpider in two steps: first we convert the VCF file to the PGD format, second convert from PGD format to Bayescan format. To do this we will need to create a SPID file, which we will call VCF_PGD.spid using the nano command.

nano VCF_PGD.spid

Into the VCG_PGD.spid file, copy and paste the code below. On the line that starts with VCF_PARSER_POP_FILE_QUESTION, replace the example location with the location of your metadata file.

Write the following into VCF_PGD.spid, making sure to not create any accidental white spaces in our file.

# VCF Parser questions 
PARSER_FORMAT=VCF
# Only output SNPs with a phred-scaled quality of at least: 
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=/home/ubuntu/outlier_analysis/data/starling_3populations_metadata_INDPOP.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=false
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false

# PGD Writer questions
WRITER_FORMAT=PGD

🔰 Writing out files with nano

Once you have copied and pasted the above, press Ctrl + O, then press Enter to save your file. Finally, exit the program using Ctrl + X.

Run the two step conversion.

#load module
module load quay.io/biocontainers/pgdspider/2.1.1.5--hdfd78af_1/module

#run the step 1 of the conversion
PGDSpider2-cli -inputfile $VCF -inputformat VCF -outputfile starling_3populations.pgd -outputformat  PGD -spid VCF_PGD.spid 

#run the step 2 of the conversion
PGDSpider2-cli -inputfile starling_3populations.pgd -inputformat PGD -outputfile starling_3populations.bs -outputformat GESTE_BAYE_SCAN

Let us have a quick look at what the input file looks like.

head starling_3populations.bs

✔️ Output
[loci]=5007

[populations]=3

[pop]=1
1 12 2 9 3
2 20 2 11 9
3 18 2 15 3
4 20 2 0 20
5 22 2 2 20

So for each population, we have a note of how many REF and ALT alleles we have at each genomic variant position.

🔰 An important note about additive genetic variance: It is important to bear in mind how the input genetic data for outlier or association models is being interpreted by the model. When dealing with many of these models (and input genotype files), the assumption is that SNP effects are additive. This can be seen from, for example, the way we encode homozygous reference allele, heterozygous, and homozygous alternate allele as "0", "1", and "2" respectively in a BayPass input genofile. For the diploid organism (with two variant copies for each allele), one copy of a variant (i.e., heterozygous) is assumed to have half the effect of having two copies. However, what if the locus in question has dominance effects? This would mean the heterozygous form behaves the same as the homozygous dominant form, and it would be more appropriate to label these instead as "0", "0", "1". But with thousands, if not millions, of (most likely) completely unknown variants in a dataset, how can we possibly know? The answer is we cannot. Most models assume additive effects since this is the simplest assumption. However, by not factoring in dominance effects, we could be missing many important functional variants, as Reynolds et al. 2021 demonstrates. Genomics is full of caveats and pitfalls. While it provides new directions for exploration, it can also be overwhelming. Remember, your selection analysis does not have to be exhaustive. Just make sure it is fit for purpose within your study design. There is so much going on in just one genome; there is no way you can analyze everything in one go.

Now let us set Bayescan to run. Currently, everything is set to default. Read the manual to understand what the arguments/flags mean and how to refine them if needed.

#load bayescan
module load quay.io/biocontainers/bayescan/2.0.1--h9f5acd7_4

#run bayescan. 
bayescan2 ./starling_3populations.bs -od ./ -threads 2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10

Ordinarily we would run this as a slurm script for about 1 hr, but for today's workshop we have prebaked files located in /home/ubuntu/outlier_analysis/backup_files/. Let's copy them over into our Bayescan analysis directory.

cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/starling_3population_AccRte.txt .
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/starling_3population_Verif.txt .
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/starling_3population_fst.txt .
cp $DIR/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/starling_3population.sel .
If you are working on your own data

You will need to execute this in a screen, use the below instructions to help you create your screen for running this Bayescan analysis.

#create screen 
screen -S bayescan

#exit screen without deleting it
Ctrl+a d

#list all the screens available in your environment
screen -ls

#reconnect to the screen you just created
screen -r bayescan

#now let's navigate to your directory and run the Bayescan command
cd /home/ubuntu/outlier_analysis/analysis/bayescan
module load quay.io/biocontainers/bayescan/2.0.1--h9f5acd7_4
bayescan2 ./starling_3populations.bs -od ./ -threads 2 -n 5000 -thin 10 -nbp 20 -pilot 5000 -burn 50000 -pr_odds 10

Identify outliers:

R
library(ggplot2)
setwd("/home/ubuntu/outlier_analysis/analysis/bayescan")
source("/home/ubuntu/outlier_analysis/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/plot_R.r")
outliers.bayescan <- plot_bayescan("starling_3population_fst.txt", FDR = 0.05)
outliers.bayescan
write.table(outliers.bayescan, file = "bayescan_outliers.txt")

✔️ Output
  $outliers
[1] 333 395 1367 2376 3789
$nb_outliers
[1] 5

If you are working on your own data

You will need to source a different plot_R code, which matches the version of Bayescan that is on Pawsey. The dummy data files were generated on version 2.1, and the version on Pawsey is 2.0.1.

source("/home/ubuntu/outlier_analysis/workshop_material/Ev1_SelectionMetaAnalysis/workshop_files/backup_files/plot_R_v2.0.1.r")

And finally, let's do a quick check of convergence. For more information please refer to this documentation.

library(coda)
chain<-read.table("starling_3population.sel",header=TRUE)
chain<-chain[-c(1)]
chain<-mcmc(chain,thin=10)
plot(chain)
heidel.diag(chain, eps=0.1, pvalue=0.05)

q()

Mapping Outliers

cd $DIR/analysis/bayescan

SNP IDs are lost in BayeScan, line numbers are used as signifiers. We have previously created a list of SNPs in VCF and line numbers, which can be found at $DIR/analysis/starling_3populations_SNPs.txt which we will now reuse to generate a list of the outlier SNPS. We will also grab the line numbers from the BayeScan outliers output.

awk '{print $2}' bayescan_outliers.txt > bayescan_outliers_numbers.txt

Create a list of outlier SNPs by matching the values in column 1 of the outliers list with those in column 4 of the entire SNP data list.

awk 'FNR==NR{a[$1];next} (($4) in a)' bayescan_outliers_numbers.txt ../starling_3populations_SNPs.txt   | cut -f3 > bayescan_outlierSNPIDs.txt

Create a Bayescan log-plot and color the outliers in a different color.

R
library(ggplot2)
library(dplyr)
setwd("/home/ubuntu/outlier_analysis/analysis/bayescan")

bayescan.out<- read.table("starling_3population_fst.txt", header=TRUE)
bayescan.out <- bayescan.out %>% mutate(ID = row_number())
bayescan.outliers<- read.table("bayescan_outliers_numbers.txt", header=FALSE)
outliers.plot <- filter(bayescan.out, ID %in% bayescan.outliers[["V1"]])

png("bayescan_outliers.pdf", width=6, height=3.5)

ggplot(bayescan.out, aes(x = log10.PO., y = alpha)) +
  geom_vline(
    xintercept = 0,
    linetype = "dashed",
    color = "black",
    size = 3
  ) +
  geom_point(
    size = 5
  ) +
  geom_point(
    aes(x = log10.PO., y = alpha),
    data = outliers.plot,
    col = "red",
    fill = "red",
    size = 5
  ) +
  scale_x_continuous(limits = c(-1.3, 3.5)) +
  theme_classic(base_size = 18) +
  theme(axis.text = element_text(size = 18),
        axis.title = element_text(size = 22, face = "bold"))

dev.off()

q()

Windowed Fst

One final thing to consider - for this analysis, we have looked for outliers across three populations. How do we know in which populations the allele frequencies have changed/are different? It might be useful to look at or plot the allele frequencies of outlier loci across different populations in order to better interpret how selection may be driving these patterns of changing allele frequencies.

BayPass

The BayPass manual can be found here.

Citation: Gautier, M. (2015). Genome-wide scan for adaptive divergence and association with population-specific covariates. Genetics, 201(4), 1555-1579. https://doi.org/10.1534/genetics.115.181453

BayPass requires that the allele frequency data be on a population, not an individual basis. The genotyping data file is organized as a matrix with nsnp rows and 2 ∗ npop columns. The row field separator is a space. More precisely, each row corresponds to one marker, and the number of columns is twice the number of populations because each pair of numbers corresponds to each allele (or read counts for PoolSeq experiment) counts in one population.

To generate this population gene count data, we will work with the PLINK file. First, we have to fix the individual ID and population labels, as PLINK has pulled these directly from the VCF, which has no population information. We aim for population in column 1 and individual ID in column 2.

cd $DIR/data

#reorder the columns of the metadata
awk '{print $2,"\t",$1}' $METADATA > starling_3populations_metadata_POPIND.txt

cd $DIR/analysis/baypass

PLINK=$DIR/data/starling_3populations.plink.ped

#remove first 2 columns
cut -f 3- $PLINK > x.delete

#paste the new POP and IND columns on the front of the old ped file
paste $DIR/data/starling_3populations_metadata_POPIND.txt x.delete > starling_3populations.plink.ped
rm x.delete 

#copy over the other plink files that we need
cp $DIR/data/starling_3populations.plink.map .
cp $DIR/data/starling_3populations.plink.log .

Let's take a quick look at both the original PLINK ped file, and the new one. Do you notice anything worrying? We have manually manipulated our file, and by not checking that the ordering of individuals in our metadata file is in the same as the ordering of individuals in the .ped file, we have accidentally misassigned genotype information to the wrong sample and population IDs! This is something you should be very careful of when manually manipulating data. If you want to fix this up in your own data, you can follow the instructions above under 'If you are working on your own data: checking metadata file order', but for now we will proceed with this slightly wrong data as the code is the same.

#original
less $PLINK

#new
less starling_3populations.plink.ped

Run the population-based allele frequency calculations.

#load module
module load quay.io/biocontainers/plink/1.90b6.21--hec16e2b_2/module

#calculate allele frequencies for each of the three populations
plink --file starling_3populations.plink --allow-extra-chr --freq counts --family --out starling_3populations

Manipulate file so it has BayPass format, numbers set for PLINK output file, and population number for column count.

tail -n +2 starling_3populations.frq.strat | awk '{ $9 = $8 - $7 } 1' | awk '{print $7,$9}' | tr "\n" " " | sed 's/ /\n/6; P; D' > starling_3populations_baypass.txt
If you are working on WGS data this might run slowly

Try this alternate code that was made by Kamolphat Atsawawaranunt.

npop2=22 # number of pop times 2
tail -n +2 myna.plink.frq.strat.reordered | awk '{ $9 = $8 - $7 } 1' | awk '{print $7,$9}' | tr " " "\n" | awk -v pp=${npop2} '{if (NR % pp == 0){a=a $0"";print a; a=""} else a=a $0" "}' > ${bayp}
${bayp} is the path to the output BayPass file.

Now we can run Baypass. It should run for about 5 minutes.

#load module
module load quay.io/biocontainers/baypass/2.31--h1c9e865_2

cd $DIR/analysis/baypass

#run BayPass
g_baypass -npop 3 -gfile ./starling_3populations_baypass.txt -outprefix starling_3populations_baypass -nthreads 4

🔰 Trying to run Bayescan and Baypass on WGS data?

These programs take a while to run, and in order to do analysis on millions of SNPs the best approach is to break your dataset down into many smaller ones, then combine the results. The authors of Baypass demonstrate such an approach in their paper on The Genomic Basis of Color Pattern Polymorphism in the Harlequin Ladybird.

The posterior mean XtX statistic column (M_XtX) in the 'pi_xtx.out' file contains a metric that captures the level of genetic differentiation for each loci. Newer versions of Baypass (v2.41 onwards) also output p-values that can be used as a way of determining XtX outliers (and can be multiple test corrected like what we did for PCAdapt). However, another way of personalising your threshold above which you will consider SNPs an outlier is to find a threshold based on neutral genetic patterns specific to your data set. For this we use the population structure covariance matrix (omega) produced by the first Baypass run, and we simulate a neutral SNP dataset for your population's set of parameters. We then calculate a percentile threshold based on this neutral data, and apply it back to our real data.

For this, we move to R to make the simulated data using some of the Baypass utilities. First, let us quickly download the utilities we need.

cd $DIR/programs
git clone https://github.com/andbeck/BayPass.git

Now let us generate some simulated data based on the population structure covariance matrix (omega) parameters calculated from our genetic data.

R
setwd("/home/ubuntu/outlier_analysis/analysis/baypass")
source("/home/ubuntu/outlier_analysis/programs/BayPass/baypass_utils.R")

library("ape")

library("mvtnorm")

omega <- as.matrix(read.table("starling_3populations_baypass_mat_omega.out"))

pi.beta.coef <- read.table("starling_3populations_baypass_summary_beta_params.out", header = TRUE)

bta14.data <- geno2YN("starling_3populations_baypass.txt")

simu.bta <- simulate.baypass(omega.mat = omega, nsnp = 5000, sample.size = bta14.data$NN, beta.pi = pi.beta.coef$Mean, pi.maf = 0, suffix = "btapods")

q()

We now have the simulated genetic data. We can find the XtX statistic threshold above which we will consider genetic sites an outlier as based on neutral patterns of genetic differentiation in our data. For this we rerun Baypass on our similated data.

cd $DIR/analysis/baypass

g_baypass -npop 3 -gfile G.btapods -outprefix G.btapods -nthreads 2

Now we obtain a M_XtX threshold that is based of neutral SNPs.

R
setwd("/home/ubuntu/outlier_analysis/analysis/baypass")
source("/home/ubuntu/outlier_analysis/programs/BayPass/baypass_utils.R")
library("ape")

library("corrplot")

pod.xtx <- read.table("G.btapods_summary_pi_xtx.out", header = T)

We compute the 1% threshold for the simulated neutral data.

pod.thresh <- quantile(pod.xtx$M_XtX ,probs = 0.99)
pod.thresh

q()

✔️ Output
  6.258372

Your values may be slightly different as the simulated data will not be identical.

Next, we filter the output of the original Baypass run for outlier SNPs, by finding those that are above this threshold.

cat starling_3populations_baypass_summary_pi_xtx.out | awk '$4>6.258372 ' > baypass_outliers.txt

SNP IDs are lost in BayPass, line numbers are used as signifiers. We have previously created a list of SNPs in VCF and line numbers, which can be found at $DIR/analysis/starling_3populations_SNPs.txt which we will now reuse to generate a list of the outlier SNPS.

awk 'FNR==NR{a[$1];next} (($4) in a)' baypass_outliers.txt $DIR/analysis/starling_3populations_SNPs.txt | cut -f3 > baypass_outlierSNPIDs.txt
wc -l baypass_outlierSNPIDs.txt

✔️ Output
  38

Now let's find SNPs that are statistically associated with wingspan. To do this, we have to go back to the metadata and compute the average wingspan of each of our populations and place them in a file.

R
setwd("/home/ubuntu/outlier_analysis/analysis/baypass")
metadata <- read.table("/home/ubuntu/outlier_analysis/data/starling_3populations_metadata.txt", sep="\t", header=FALSE)
str(metadata)
pop_metadata <- aggregate(V3 ~ V2, data = metadata, mean)

# Check mean wingspan
pop_metadata[, 2]
write(pop_metadata[,2], "pop_mean_wingspan.txt", ncolumns = length(pop_metadata[,2]))

q()

✔️ Output
  14.89805 19.63306 22.09655

Now we can run the third and final BayPass job, which will let us know which SNPs are statistically associated with wingspan.

g_baypass -npop 3 -gfile starling_3populations_baypass.txt -efile pop_mean_wingspan.txt -scalecov -auxmodel -nthreads 4 -omegafile starling_3populations_baypass_mat_omega.out -outprefix starling_3populations_baypass_wing

Next we plot the outliers. We are choosing a BF threshold of 20 dB, which indicates "Strong evidence for alternative hypothesis."

R

library(ggplot2)

setwd("/home/ubuntu/outlier_analysis/analysis/baypass")

covaux.snp.res.mass <- read.table("starling_3populations_baypass_wing_summary_betai.out", header = T)
covaux.snp.xtx.mass <- read.table("starling_3populations_baypass_summary_pi_xtx.out", header = T)

pdf("Baypass_plots.pdf")
layout(matrix(1:3,3,1))
plot(covaux.snp.res.mass$BF.dB.,xlab="Mass",ylab="BFmc (in dB)")
abline(h=20, col="red")
plot(covaux.snp.res.mass$M_Beta,xlab="SNP",ylab=expression(beta~"coefficient"))
plot(covaux.snp.xtx.mass$M_XtX, xlab="SNP",ylab="XtX corrected for SMS")
dev.off()

q()

Baypass output

Finally, let's generate the list of phenotype-associated SNP IDs.

cat starling_3populations_baypass_wing_summary_betai.out | awk '$6>20' > starling_3populations_baypass_wing_BF20.txt

wc -l starling_3populations_baypass_wing_BF20.txt

✔️ Output
  48

Filter the data sets for SNPS above BFmc threshold. These are out outlier SNPs that are associated with wingspan.

awk 'FNR==NR{a[$2];next} (($4) in a)' starling_3populations_baypass_wing_BF20.txt ../starling_3populations_SNPs.txt | cut -f3 > baypass_wingspan_outlierSNPIDs.txt

comm -12 <(sort baypass_wingspan_outlierSNPIDs.txt) <(sort baypass_outlierSNPIDs.txt) > double_outliers.txt

wc -l double_outliers.txt

✔️ Output
  18

Comparing outlier overlap

Now we will make an UpSet plot to compare the overlap of outliers detected over our different methods.

cd $DIR/analysis/summary
ln -s $DIR/analysis/pcadapt/pcadapt_outlierSNPIDs.txt .
ln -s $DIR/analysis/vcftools_fst/vcftoolsfst_outlierSNPIDs.txt .
ln -s $DIR/analysis/bayescan/bayescan_outlierSNPIDs.txt .
ln -s $DIR/analysis/baypass/baypass_outlierSNPIDs.txt .
ln -s $DIR/analysis/baypass/baypass_wingspan_outlierSNPIDs.txt .

Now we have a copy of all the SNP IDs for each of our outlier analyses, let's use the R package UpSetR to plot the overlap.

R
setwd("/home/ubuntu/outlier_analysis/analysis/summary")

pcadapt <- scan("pcadapt_outlierSNPIDs.txt", what = "", quiet = TRUE)
vcftools <- scan("vcftoolsfst_outlierSNPIDs.txt", what = "", quiet = TRUE)
bayescan <- scan("bayescan_outlierSNPIDs.txt", what = "", quiet = TRUE)
baypass <- scan("baypass_outlierSNPIDs.txt", what = "", quiet = TRUE)
baypass_wing <- scan("baypass_wingspan_outlierSNPIDs.txt", what = "", quiet = TRUE)  

all_outliers <- list(PCAdapt = pcadapt, VCFtools = vcftools, Bayescan = bayescan, Baypass = baypass, BaypassWing = baypass_wing)

library(UpSetR)

pdf("All_outliers_upsetplot.pdf")
upset(
  data = fromList(all_outliers), 
  order.by = "freq", 
  empty.intersections = "on", 
  point.size = 3.5, 
  line.size = 2, 
  mainbar.y.label = "Outlier Count", 
  sets.x.label = "Total Outliers", 
  text.scale = c(1.3, 1.3, 1, 1, 2, 1.3), 
  number.angles = 30, 
  nintersects = 11
) 
dev.off() 

q()

upset plot of outlier overlaps

Let's have a discussion about the overlap between these five outlier groups. Many of these programs were asking different questions from one another. PCAdapt is looking for genome-wide outliers, with no population identities assigned. VCFTools looked at outliers between two of our three populations, while Bayescan looked at all three, and also had multiple test correcting and so is quite strict. Baypass was looking at the same three populations as Bayescan, but tries to account for population structure and also we applied a quartile threshold method for determining outliers. Finally, using Baypass for association analysis with wingspan meant that we are not identifying loci different between population, but rather loci associated with morphology.

If you want to get really fancy, you may even want to plot your variants at their location around your genome in a circle plot!

Workshop End discussion and Q&A

A brief period of group discussion on one of the days about research question framing and grant integration - refer to the slides if you need some ideas. Below I will curate a list of useful papers that demonstrate some cool things you can do with your outier loci.

Trying to account for strong population structure while performing environmental associations with your genetic data (GEA)?
Try using dummy variables to identify false positives.

Trying to run lots of SNPs and it is taking too long?
Try splitting your data into many small subsets.

Trying to combine lists of ourliers obtained from different approaches in a statistical way?
Try p-value geometric mean calculations.

Trying to see if anyone with a dataset like yours has run a specific program?
Search for the paper title of your prefered program in google scholar, and click the 'Cited by XXX'. Then, click the box 'Search within citing articles' and look up your study keywords in the literature that has cited that program, and you may find some relevant ones. Even better, if it is a recent paper in a journal that promotes open science (e.g., Molecular Ecology) the authors may have made their code publically available.

Funding

ScreenShot

Thank you to the AES ERC Networking Grant Scheme and GSA for funding this project.

Thank you to the New Zealand eScience Infrastructure (NeSI) for helping to facilitate the New Zealand workshops.

Thank you to the Australian BioCommons and Pawsey for helping to facilitate the Australian workshops.

Footnotes

A it is very important then to account for any population substructure. There are many different ways to approach this: refer to introduction slides for some guidance.
B I will say population for simplicity throughout this vignette. However, equally we can test for differences between sample sites, subpopulations, and other types of groupings. What counts as one 'group' of organisms will be dependent on your study system or study question.
C You may also want to consider linkage blocks.