NOTE: This project is no longer actively maintained.
SeleDiff
implements a probabilistic method for estimating and testing selection (coefficient) differences between populations1.- If you have any problem, please feel free to contact xinhuang.res@gmail.com, or open an issue in this repository.
- If you would like to reproduce our simulation, please check the codes in
./appendix
. - If you are interested in contributing to
SeleDiff
, please feel free to clone and modify it. You should include unit tests for your modified codes. Besides, you can editbuild.gradle
to include new dependencies. After your modification, please send a GitHub Pull Request with a clear list of what you've done. - For more details, please see the manual in
./docs
.
To install SeleDiff
, you should first install Java SE Development Kit 8
or OpenJDK8.
In Linux/Mac, you can open the terminal and clone SeleDiff
using git
:
> git clone https://github.com/xin-huang/SeleDiff
Then you can enter the SeleDiff
directory and use gradlew
to install SeleDiff
:
> cd ./SeleDiff
> ./gradlew build
> ./gradlew install
The runnable SeleDiff
is in ./build/install/SeleDiff/bin/
. You can add this directory into your PATH
environment variable by:
> export PATH="/path/to/SeleDiff/build/install/SeleDiff/bin/":$PATH
You can get help information by typing:
> SeleDiff
You can use gradlew
to remove SeleDiff
:
> ./gradlew clean
In Windows, you can download the latest release. Please make sure your environment variable JAVA_HOME
correctly point to your JDK directory. After download and uncompression, you can open cmd
and enter the directory of SeleDiff
in cmd
. Please use gradlew.bat
to build and install SeleDiff
.
> cd /path/to/SeleDiff
> gradlew.bat build
> gradlew.bat install
And run SeleDiff.bat
in ./build/install/SeleDiff/bin/
:
> cd /build/install/SeleDiff/bin/
> SeleDiff.bat
You can use gradlew.bat
to remove SeleDiff
:
> cd /path/to/SeleDiff
> gradlew.bat clean
SeleDiff
contains two sub-commands:
compute-var
for estimating variances of Ω1, which is required for thecompute-diff
command;compute-diff
for estimating selection differences among loci.
SeleDiff
assumes bi-allelic genetic data and will not perform any checks on this assumption. All input files can be compressed by gzip
.
SeleDiff
accepts EIGENSTRAT format of genetic data as inputs. EIGENSOFT provides several functions to convert other formats to EIGENSTRAT format.
SeleDiff
also accepts VCF format of genetic data as inputs, and assumes genotypes of each individual are encoded with 0 and 1. Because VCF format contains no population information of each individual, users should provide an additional file following EIGENSTRAT IND format.
The Var file is the output file from the first sub-command compute-var
, which stores variances of pairwise Ω.
SeleDiff
does not divide Ω with generation times as He et al. (2015) in order to reduce floating-point rounding errors.
When estimating Ω, SeleDiff
uses SNPs are not fixed in any population.
When using sub-command compute-diff
to estimate selection differences, SeleDiff
uses --var
option to accept a a SPACE delimited file without header that specifies variances of Ω between populations.
YRI CEU 1.547660
YRI CHS 1.639591
CEU CHS 0.989241
The first two columns are the population IDs, and the third column is the variances of Ω between populations.
When using sub-command compute-diff
to estimate selection differences, SeleDiff
uses --time
option to accept a SPACE delimited file without header that specifies divergence times between two populations.
YRI CEU 5000
YRI CHS 5000
CEU CHS 3000
The first two columns are the population IDs, and the third column is the divergence times of the two populations.
The output file from SeleDiff
is TAB delimited. The first row is a header that describes the meaning of each column.
Column | Column Name | Description |
---|---|---|
1 | SNP ID | The name of a SNP |
2 | Ref | The reference allele |
3 | Alt | The alternative allele |
4 | Population1 | The first population ID |
5 | Population2 | The second population ID |
6 | Selection difference | The selection difference between the first and second populations |
7 | Std | The standard deviation of the selection difference |
8 | Lower bound of 95% CI | Lower bound of 95% confidence interval of the selection difference |
9 | Upper bound of 95% CI | Upper bound of 95% confidence interval of the selection difference |
10 | Delta | The delta statistic for selection difference |
11 | p-value | The p-value of the delta statistic |
Here is an example to show how SeleDiff
estimates and tests selection differences between populations. Four populations (YRI, CEU, CHB, CHD) from HapMap3 (release3) were extracted. CHB and CHD were merged into one population called CHS. PLINK 1.7 were used to remove correlated individuals and SNPs with minor allele frequences less than 0.05 and strong linkage disequilibrium. These genome-wide data are stored in ./examples/data/example.geno
and used for estimating variances of Ω.
Two alternative alleles (rs1800407 and rs12913832) associated with blue eyes were identified in genes HERC2 and OCA22. These candidate data are stored in ./examples/data/example.candidates.geno
and used for estimating selection differences of these SNPs between populations.
The counts of alleles in our example data were summarized in below.
SNP ID | Population | Reference Allele Count | Alternative Allele Count |
---|---|---|---|
rs1800407 | YRI | 290 | 0 |
rs1800407 | CEU | 207 | 17 |
rs1800407 | CHS | 486 | 4 |
rs12913832 | YRI | 294 | 0 |
rs12913832 | CEU | 47 | 177 |
rs12913832 | CHS | 491 | 1 |
We assume the divergence time of YRI-CEU and YRI-CHS are both 5000 generations, while the divergence time of CEU-CHS is 3000 generations. This information is stored in ./examples/data/example.time
.
First, we estimate variances of Ω using sub-command compute-var
:
> SeleDiff compute-var --geno ./examples/data/example.geno \
--ind ./examples/data/example.ind \
--snp ./examples/data/example.snp \
--output ./examples/results/example.geno.var
To estimate selection differences of candidates, we use the sub-command compute-diff
:
> SeleDiff compute-diff --geno ./examples/data/example.candidates.geno \
--ind ./examples/data/example.candidates.ind \
--snp ./examples/data/example.candidates.snp \
--var ./examples/results/example.geno.var \
--time ./examples/data/example.time \
--output ./examples/results/example.candidates.geno.results
The result is stored in ./examples/results/example.candidates.geno.results
. The main result is in below.
SNP ID | Population1 | Population2 | Selection difference | Std | delta | p-value |
---|---|---|---|---|---|---|
rs1800407 | YRI | CEU | -0.000773 | 0.000380 | 4.129 | 0.042154 |
rs1800407 | YRI | CHS | -0.000336 | 0.000393 | 0.731 | 0.392559 |
rs1800407 | CEU | CHS | 0.000728 | 0.000377 | 3.730 | 0.053443 |
rs12913832 | YRI | CEU | -0.001541 | 0.000378 | 16.583 | 0.000047 |
rs12913832 | YRI | CHS | -0.000117 | 0.000415 | 0.080 | 0.777297 |
rs12913832 | CEU | CHS | 0.002372 | 0.000433 | 30.062 | 0.000000 |
From the result, we can see the selection coefficient of rs12913832 in CEU is significantly larger than that in YRI or CHS, which indicates rs12913832 is under directional selection in CEU. While the selection coefficient of rs1800407 in CEU is marginal significantly larger than that in YRI or CHS.
Please refer to our previous study1 for a more comprehensive working example using the HapMap3 dataset.