Skip to content

gdlc/BGLR.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BGLR for the Julia Language

The BGLR Julia package implements Bayesian shrinkage and variable selection methods for high-dimensional regressions.

The design is inspired on the BGLR R-package (BGLR-R). Over time we aim to implement a similar set of methods than the ones implelented in the R version. The R version is highly optimized with use of compiled C code; this gives BGLR-R a computational speed much higher than the one that can be obtained using R code only.

By developing BGLR for Julia (BGLR-Julia) wee seek to: (i) reach the comunity of Julia users, (ii) achieve a similar or better performance than the one achieved with BGLR-R (this is challenging because BGLR-R is highly optimized and makes intensive use of C and BLAS routines), (iii) enable users to use BGLR with memory-mapped arrays as well as RAM arrays, (iv) capitalize on some of multi-core computing capabilities offered by Julia.

Funding of BGLR-R and BGLR-Julia was provided by NIH (R01 GM101219).

Authors: Gustavo de los Campos (gustavoc@msu.edu) and Paulino Perez-Rodriguez (perpdgo@gmail.com)

Installing BGLR-Julia

  Pkg.rm("BGLR")
  Pkg.clone("https://github.com/gdlc/BGLR.jl")

Data sets

The examples presented below use either the wheat data sets provided with BGLR-R package. To get these data in the computing environment you can use the following code.

Wheat data set

 # Pedigree-relationship matrix
  A=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.A.csv");header=true)[1];
 # Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  Y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1];

Examples

Genomic BLUP

The example below fits the so-called G-BLUP model.

 using BGLR
 
# Computing G-Matrix
  n,p=size(X);
  X=scale(X);
  G=X*X';
  G=G./p;

# Fitting the model
  # Readme: y is the phenotype vector, 
  #         ETA is a dictionary used to specify terms to be used in the regression,
  #         In this case ETA has only one term. 
  #         RKHS(K=G) is used to define a random effect with covariance matrix G.
  
  fm=bglr( y=y, ETA=Dict("mrk"=>RKHS(K=G)));
  
## Retrieving estimates and predictions
  fm.varE # posterior mean of error variance
  fm.yHat # predictions
  fm.ETA["mrk"].var # variance of the random effect

Bayesian Ridge Regression

 using BGLR
 
 # Reading Data 
   X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
   y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];
  
  # Bayesian Ridge Regression
   ETA=Dict("mrk"=>BRR(X))
   fm=bglr(y=y,ETA=ETA);
  
  ## Retrieving estimates and predictions
  fm.varE # posterior mean of error variance
  fm.yHat # predictions
  fm.ETA["mrk"].var # variance of the random effect associated to markers

Bayesian LASSO

#Bayesian LASSO
 using BGLR
 using Gadfly

#Reading Data
 X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

# Bayesian LASSO
 ETA=Dict("mrk"=>BL(X))
 fm=bglr(y=y,ETA=ETA);

#Plots
 plot(x=fm.y,y=fm.yHat)

BayesA

#BayesA
  using BGLR
  using Gadfly

# Reading Data
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

# BayesA
  ETA=Dict("mrk"=>BayesA(X))
  fm=bglr(y=y,ETA=ETA);

#Plots
  plot(x=fm.y,y=fm.yHat)

BayesB

#BayesB
  using BGLR
  using Gadfly

# Reading Data
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

# BayesB
  ETA=Dict("mrk"=>BayesB(X))
  fm=bglr(y=y,ETA=ETA);

#Plots
  plot(x=fm.y,y=fm.yHat)

Parametric Shrinkage and Variable Selection

##########################################################################################
# Example 1 of  BGLR
# Box 6 (simulation)
##########################################################################################

  using BGLR
  using Distributions
  using Gadfly


# Reading data (markers are in BED binary format, http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml).
  X=read_bed(joinpath(Pkg.dir(),"BGLR/data/mice.X.bed"),1814,10346);

# Simulation

  srand(456);
  n,p=size(X);
  X=scale(X);

  h2=0.5;
  nQTL=10;
  whichQTL=[517,1551,2585,3619,4653,5687,6721,7755,8789,9823];
  b0=zeros(p);
  b0[whichQTL]=rand(Normal(0,sqrt(h2/nQTL)),10);
  signal=X*b0;
  error=rand(Normal(0,sqrt(1-h2)),n);
  y=signal+error;

##########################################################################################
# Example 1 of  BGLR
# Box 7 (fitting models)
##########################################################################################


#Bayesian Ridge Regression
  ETA_BRR=Dict("BRR"=>BRR(X))

#BayesA
  ETA_BA=Dict("BA"=>BayesA(X))

#BayesB
  ETA_BB=Dict("BB"=>BayesB(X))
 
#Fitting models
  fmBRR=bglr(y=y,ETA=ETA_BRR,nIter=10000,burnIn=5000);
  fmBA=bglr(y=y,ETA=ETA_BA,nIter=10000,burnIn=5000);
  fmBB=bglr(y=y,ETA=ETA_BB,nIter=10000,burnIn=5000);

#Plots
  plot(x=fmBB.y,y=fmBB.yHat)

  p1=plot(x=[1:p],y=abs(fmBB.ETA["BB"].post_effects),
          xintercept=whichQTL,
	  Geom.point,
          Geom.vline(color=color("blue")),
          Theme(default_color=color("red"),grid_color=color("white"),
                grid_color_focused=color("white"),highlight_width=0mm,
                default_point_size=1.5pt,
                panel_stroke=color("black")),
          Guide.xticks(ticks=[0,2000,4000,6000,8000,10000]),
          Guide.xlabel("Marker position (order)"),
          Guide.ylabel("|&beta;<sub>j</sub>|"))
  q1=layer(x=whichQTL,y=abs(b0[whichQTL]),Geom.point,Theme(default_color=color("blue"))); 
  append!(p1.layers,q1);

  display(p1)

Fitting models for genetic and non genetic factors

##########################################################################################
# Example 2 of  BGLR
# Box 8
##########################################################################################

using BGLR
using Gadfly

# Reading data (markers are in BED binary format, http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml).
  X=read_bed(joinpath(Pkg.dir(),"BGLR/data/mice.X.bed"),1814,10346);
  pheno=readcsv(joinpath(Pkg.dir(),"BGLR/data/mice.pheno.csv");header=true);
  varnames=vec(pheno[2]); pheno=pheno[1]
  y=pheno[:,varnames.=="Obesity.BMI"] #column for BMI
  y=convert(Array{Float64,1}, vec(y))
  y=(y-mean(y))/sqrt(var(y))
  
  
# Incidence matrix for sex and litter size using a dummy variable for each level
  male=model_matrix(pheno[:,varnames.=="GENDER"];intercept=false)
  litterSize=model_matrix(pheno[:,varnames.=="Litter"];intercept=false)
  W=hcat(male, litterSize)
  

# Incidence matrix for cage, using a dummy variable for each level
  Z=model_matrix(pheno[:,varnames.=="cage"];intercept=false)

  ETA=Dict("Fixed"=>FixEff(W),
  	   "Cage"=>BRR(Z),
	   "Mrk"=>BL(X))

  srand(456);

  fm=bglr(y=y,ETA=ETA,nIter=105000,burnIn=5000);

#Plots

  #a)
  plot(x=[1:10346],y=fm.ETA["Mrk"].post_effects.^2,
       Geom.point,Geom.line,
       Guide.xticks(ticks=[0,2000,4000,6000,8000,10000]),
       Theme(highlight_width=0mm, panel_stroke=color("black")),
       Guide.xlabel("Marker position (order)"),
       Guide.ylabel("&beta;<sub>j</sub> <sup>2</sup>"),
       Guide.title("Marker Effects"))  

  #b)
  plot(x=fm.y,
       y=fm.yHat,
       Theme(panel_stroke=color("black")),
       Guide.ylabel("yHat"),
       Guide.xlabel("y"),
       Guide.title("Observed vs predicted"))
  #c)
  varE=vec(readdlm("varE.dat")[:,1]);
  plot(x=[1:size(varE)[1]],y=varE,
       yintercept=[mean(varE)],
       Geom.hline(color=color("red"),size=1mm),
       Geom.point,
       Geom.line,
       Guide.xticks(ticks=[0,5000,10000,15000,20000]),
       Theme(panel_stroke=color("black")),
       Guide.xlabel("Sample"),
       Guide.ylabel("&sigma;<sub>e</sub> <sup>2</sup>"),
       Guide.title("Residual Variance"))

  #d)
  lambda=vec(readdlm("Mrk_lambda.dat")[:,1]);
  plot(x=[1:size(lambda)[1]],y=lambda,
       yintercept=[mean(lambda)],
       Geom.hline(color=color("red"),size=1mm),
       Geom.point,
       Geom.line,
       Guide.xticks(ticks=[0,5000,10000,15000,20000]),
       Theme(panel_stroke=color("black")),
       Guide.xlabel("Sample"),
       Guide.ylabel("&lambda;"),
       Guide.title("Regularization Parameter"))

Fitting a pedigree + markers BLUP model

##########################################################################################
# Example 3 of  BGLR
# Box 9
##########################################################################################

using BGLR

# Reading Data 
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

 #Relationship matrix derived from pedigree
  A=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.A.csv");header=true);
  A=A[1]; #The first component of the tuple has the data

# Computing G-Matrix
  n,p=size(X);
  X=scale(X);
  G=X*X';
  G=G./p;

# Setting the linear predictor
  ETA=Dict("Mrk"=>RKHS(K=G),
           "Ped"=>RKHS(K=A)) 
  
#Fitting the model
  fm=bglr(y=y,ETA=ETA);

Reproducing Kernel Hilbert Spaces Regression with single Kernel methods

#Reproducing Kernel Hilbert Spaces
#Single kernel methods
#Example 4 of BGLR paper
#Box 10


using BGLR
using Distances
using Gadfly

# Reading Data 
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];
  
#Computing the distance matrix and then the kernel
#pairwise function computes distance between columns, so we transpose
#the matrix to get distance between rows of X
  n,p=size(X);
  X=scale(X);
  D=pairwise(Euclidean(),X');
  D=(D.^2)./p;
  h=0.25;
  K1=exp(-h.*D);
  
# Kernel regression
  ETA=Dict("mrk"=>RKHS(K=K1));
  fm=bglr(y=y,ETA=ETA);
  
#Plots
	plot(x=fm.y,
     	 y=fm.yHat,
         Guide.ylabel("yHat"),
         Guide.xlabel("y"),
         Guide.title("Observed vs predicted"))

## Retrieving estimates and predictions
  fm.varE # posterior mean of error variance
  fm.yHat # predictions
  fm.ETA["mrk"].var # variance of the random effect

Reproducing Kernel Hilbert Spaces Regression with Kernel Averaging

#Reproducing Kernel Hilbert Spaces
#Multi kernel methods (Kernel Averaging)
#Example 5 of BGLR paper
#Box 11

using BGLR
using Distances
using Gadfly

# Reading Data 
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];
  
#Computing the distance matrix and then the kernel
#pairwise function computes distance between columns, so we transpose
#the matrix to get distance between rows of X
  n,p=size(X);
  X=scale(X);
  D=pairwise(Euclidean(),X');
  D=(D.^2)./p;
  
  d=reshape(tril(D),n*n,1);
  d=d[d.>0];
  h=1/median(d);
  h=h.*[1/5,1,5];
  
  K1=exp(-h[1].*D);
  K2=exp(-h[2].*D);
  K3=exp(-h[3].*D);
  
# Kernel regression
  ETA=Dict("Kernel1"=>RKHS(K=K1),
           "Kernel2"=>RKHS(K=K2),
           "Kernel3"=>RKHS(K=K3));
  fm=bglr(y=y,ETA=ETA);
  
#Plots
	plot(x=fm.y,
     	 y=fm.yHat,
         Guide.ylabel("yHat"),
         Guide.xlabel("y"),
         Guide.title("Observed vs predicted"))

## Retrieving estimates and predictions
  fm.varE # posterior mean of error variance
  fm.yHat # predictions
  fm.ETA["Kernel1"].var # variance of the random effect
  fm.ETA["Kernel2"].var # variance of the random effect
  fm.ETA["Kernel3"].var # variance of the random effect

Prediction in testing data sets using a single training-testing partition

#Assesment of prediction accuracy using a single training-testing partition
#Example 6 of BGLR paper
#Box 12
using BGLR
using StatsBase
using Gadfly

# Reading Data
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

# Computing G-Matrix
  n,p=size(X);
  X=scale(X);
  G=X*X';
  G=G./p;

#Creating a Testing set
 yNA=deepcopy(y)
 srand(456);
 tst=sample([1:n],100;replace=false)
 yNA[tst]=-999

#Fitting the model
 ETA=Dict("mrk"=>RKHS(K=G))

 fm=bglr(y=yNA,ETA=ETA;nIter=5000,burnIn=1000);

#Correlation in training and testing sets
 trn=(yNA.!=-999);
 rtst=cor(fm.yHat[tst],y[tst]);
 rtrn=cor(fm.yHat[trn],y[trn]);
 rtst
 rtrn

 plot(layer(x=y[trn],
            y=fm.yHat[trn],Geom.point,Theme(default_color=color("blue"))),
      layer(x=y[tst],y=fm.yHat[tst],Geom.point,Theme(default_color=color("red"))),
      Guide.ylabel("yHat"),
      Guide.xlabel("y"),
      Guide.title("Observed vs predicted"))

Prediction in testing data sets based on multiple training-testing partitions

#Assesment of prediction accuracy using multiple training-testing partition
#Example 7 of BGLR paper
#Box 13 
using BGLR
using StatsBase
using Gadfly

# Reading Data
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

 #Relationship matrix derived from pedigree
  A=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.A.csv");header=true);
  A=A[1]; #The first component of the Tuple

# Simulation parameters
  srand(123);
  nTST=150;
  nRep=100;
  nIter=12000;
  burnIn=2000; 


# Computing G-Matrix
  n,p=size(X);
  X=scale(X);
  G=X*X';
  G=G./p;

# Setting the linear predictors
#Very weird, if you run the model several times with different
#missing value patterns and the same Dictionaries, the objects
#become some how corrupted and then the variances are very high and the
#predictions very bad!, but if you define the dictionary inside the call to
#bglr function it works
#  H0=Dict("PED"=>RKHS(K=A));
#  HA=Dict("PED"=>RKHS(K=A),
#          "MRK"=>RKHS(K=G));

  COR=zeros(nRep,2);

# Loop over TRN-TST partitions
  for i in 1:nRep
    println("i=",i)
    tst=sample([1:n],nTST;replace=false)
    yNA=deepcopy(y)
    yNA[tst]=-999
    fm=bglr(y=yNA,ETA=Dict("PED"=>RKHS(K=A));nIter=nIter,burnIn=burnIn);
    COR[i,1]=cor(fm.yHat[tst],y[tst]);
    fm=bglr(y=yNA,ETA=Dict("PED"=>RKHS(K=A),"MRK"=>RKHS(K=G));nIter=nIter,burnIn=burnIn);
    COR[i,2]=cor(fm.yHat[tst],y[tst]);
  end

#Plots
plot(layer(x=COR[:,1],y=COR[:,2],Geom.point,Theme(default_color=color("red"))),
     layer(x=[0,0.6],y=[0,0.6],Geom.line,Theme(default_color=color("black"))),
     Guide.xlabel("Pedigree"),
     Guide.ylabel("Pedigree+Markers"),
     Guide.title("E1"))

Modeling heterogeneous error variances

#Heterogeneous variances

using BGLR
using Gadfly

# Reading Data 
   X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
   y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1][:,1];

# Bayesian Ridge Regression
   ETA=Dict("mrk"=>BRR(X))
   
#Grouping
  groups=vec(readdlm(joinpath(Pkg.dir(),"BGLR/data/wheat.groups.csv")));
  groups=convert(Array{Int64,1},groups)

  fm=bglr(y=y,ETA=ETA;groups=groups);
  
  ## Retrieving estimates and predictions
  fm.varE # posterior mean of error variance
  fm.yHat # predictions
  fm.ETA["mrk"].var # variance of the random effect

  plot(x=fm.y,
       y=fm.yHat,
       Guide.ylabel("yHat"),
       Guide.xlabel("y"),
       Guide.title("Observed vs predicted"))

Modeling genetic by environment interactions

#Genotype x Environment interaction
#Rection norm model based on markers
using BGLR
using Gadfly

# Reading Data
 #Markers
  X=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.X.csv");header=true)[1];
 #Phenotypes
  Y=readcsv(joinpath(Pkg.dir(),"BGLR/data/wheat.Y.csv");header=true)[1];

  n,p=size(X);

 #response vector
  y=vec(Y);

 #Environments
  Env=[rep(1;times=n);rep(2;times=n);rep(3;times=n);rep(4,times=599)];

 #genotypes
  g=[1:n;1:n;1:n;1:n]

 #Genomic relationship matrix
  X=scale(X);
  G=X*X';
  G=G./p;

 #Model matrix for environments
  Ze=model_matrix(Env;intercept=false);

 #Model matrix for genotypes
 #in this case is identity because the data is already ordered
  Zg=model_matrix(g;intercept=false);

 #Basic reaction norm model
 #y=Ze*beta_Env+X*beta_markers+u, where u~N(0,(Zg*G*Zg')#ZeZe')

 #Variance covariance matrix for the interaction
 K1=(Zg*G*Zg').*(Ze*Ze');

 #Linear predictor
  ETA=Dict("Env"=>BRR(Ze),
           "Mrk"=>BRR(X),
           "GxE"=>RKHS(K=K1));

 #Fitting the model
  fm=bglr(y=y,ETA=ETA);


 plot(x=fm.y,
     y=fm.yHat,
     Guide.ylabel("yHat"),
     Guide.xlabel("y"),
     Guide.title("Observed vs predicted"))

BGLR-J Utils (a collection of utilitary functions)

  • [model_matrix]
  • [read_bed]
  • [writeln]
  • [levels]
  • [nlevels]
  • [renumber]
  • [rep]
  • [table]
  • [sumsq]
  • [sumsq_group]
  • [innersimd]
  • [my_axpy!]
  • [scale]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages