Lab Website | Build Status | Documentation |
---|---|---|
A Julia package for building genomic prediction models and performing genome-wide association on quantitative traits by inferring additive allelic effects using pool sequencing data (Pool-seq; i.e. allele frequencies).
Install dependencies:
Installation in Julia:
using Pkg
Pkg.add(PackageSpec(url="https://github.com/jeffersonfparil/GWAlpha.jl.git", rev="master"))
using GWAlpha
PoolGPAS(;filename_sync::String, filename_phen::String, maf::Float64=0.001, depth::Int64=1, model::String=["GWAlpha", "MIXED", "GLMNET"][1], filename_random_covariate=nothing, random_covariate::String=["FST", "RELATEDNESS"][1], varcomp_est::String=["ML", "REML"][1], glmnet_alpha::Float64=collect(range(0.0,1.0,step=0.01,))[1], fpr::Float64=0.01, plot::Bool=false)
- filename_sync [String]: filename of the genotype data in synchronized pileup format (.sync)
- filename_phen [String]: filename of the phenotype data in one of two formats:
- .py extension for iterative maximum likelihood estimation i.e.
MODEL="GWAlpha"
, e.g.:
Pheno_name='Phenotype Name';
sig=0.06724693662723039; # standard deviation
MIN=0.0; # minimum phenotype value
MAX=0.424591738712776; # maximum phenotype value
perc=[0.2,0.4,0.6,0.8]; # cummulative pool sizes percentiles excluding the last pool
q=[0.16,0.20,0.23,0.27,0.42]; # phenotype values corresponding to each percentile
- .csv extension for comma-separated headerless pool sizes and corresponding mean phenotypic values, e.g.:
200.0,0.11988952929875112
200.0,0.18030259365994225
200.0,0.21548030739673382
200.0,0.24966378482228616
200.0,0.31328530259365983
- maf [Float64]: minimum allele frequency threshold (default=0.001)
- depth [Int64]: minimum sequencing depth threshold (default=1)
- model [String]: model to use (default="GWAlpha")
- "GWAlpha" - iterative maximum likelihood estimation
- "MIXED" - linear mixed model
- "GLMNET" - elastic-net penalization
- filename_random_covariate [String]: filename of a precomputed headerless square symmetric matrix of pool relatedness (default=nothing)
- random_covariate [String]: type of relatedness matrix to compute if
filename_random_covariate==nothing
(default="FST")- "FST" - pairwise estimation of fixation indices using Pool-seq data using Weir and Cockerham, 1984 method (additionally Hivert et al, 2018 method is also available: see
?GWAlpha.relatedness_module.Fst_pairwise
) - "RELATEDNESS" - simple standardized relatedness matrix
XX'/p
, whereX
is the allele frequency matrix (Pool-seq data) andp
is the number of columns ofX
- "FST" - pairwise estimation of fixation indices using Pool-seq data using Weir and Cockerham, 1984 method (additionally Hivert et al, 2018 method is also available: see
- glmnet_alpha [Float64]: elastic-net penalty (default=0.00 or ridge regression penalty)
- varcomp_est [String]: variance component estimation method
- "ML" - maximum likelihood estimation
- "REML" - restricted maximum likelihood estimation
- glmnet_alpha [Float64]: elastic-net penalty (from 0.00 for ridge regression up to 1.00 for LASSO)
- fpr [Float64]: false positive rate for computing the Bonferroni threshold in significance testing (default=0.01)
- plot [Bool]: generate Manhattan and quantile-quantile (QQ) plots, and save in a portable network graphics (.png) format (default=false)
- Additive allelic effects array (header: CHROM, POS, ALLELE, FREQ, ALPHA, PVALUES, LOD) written into a comma-separated (.csv) file
- "GWAlpha":
string(join(split(filename_sync, ".")[1:(end-1)], '.'), "-GWAlpha-OUTPUT.csv")
- "MIXED":
string(join(split(filename_sync_filtered, ".")[1:(end-1)], '.'), "-", model, varcomp_est, "_", random_covariate, "-OUTPUT.csv")
- "GLMNET":
string(join(split(filename_sync_filtered, ".")[1:(end-1)], '.'), "-", model, "_ALPHA", glmnet_alpha, "-OUTPUT.csv")
- "GWAlpha":
- Random covariate effects vector (headerless: RANDOM_EFFECTS) written into a comma-separated (.csv) file
- "GWAlpha": nothing
- "MIXED":
string(join(split(filename_sync_filtered, ".")[1:(end-1)], '.'), "-", model, "_", random_covariate, "-RANEF-OUTPUT.csv")
- "GLMNET": nothing
- Manhattan and QQ plots in .png format
string(join(split(filename_output_csv, ".")[1:(end-1)], '.'), ".png")
- Parsing, filtering, and relatedness matrix outputs:
- Parsed sync data into a .csv file of allele frequency data (headerless: CHROM, POS, ALLELE, FREQ_POOL1, ..., FREQ_POOLn):
string(join(split(filename_sync, ".")[1:(end-1)], '.'), "_ALLELEFREQ.csv")
- Filtered sync data into a sync file: filename_sync_filtered =
string(join(split(filename_sync, ".")[1:(end-1)], "."), "_MAF", maf, "_DEPTH", depth, ".sync")
- Pairwise pool relatedness (square, symmetric and headerless):
string(join(split(filename_sync_filtered, ".")[1:(end-1)], '.'), "_COVARIATE_", random_covariate, ".csv")
- Parsed sync data into a .csv file of allele frequency data (headerless: CHROM, POS, ALLELE, FREQ_POOL1, ..., FREQ_POOLn):
### Input files:
filename_sync = "test/test.sync"
filename_phen_py = "test/test.py"
filename_phen_csv = "test/test.csv"
### Single-threaded execution:
using GWAlpha
@time OUT_GWAS = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_py, maf=0.001, depth=10, model="GWAlpha", fpr=0.01, plot=true)
@time OUT_MIXEDML_FST = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_csv, maf=0.001, depth=10, model="MIXED", random_covariate="FST", fpr=0.01, plot=true)
@time OUT_MIXEDREML_RELATEDNESS = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_csv, maf=0.001, depth=10, model="MIXED", random_covariate="RELATEDNESS", varcomp_est="REML", fpr=0.01, plot=true)
@time OUT_GLMNET_RIDGE = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_csv, maf=0.001, depth=10, model="GLMNET", glmnet_alpha=0.00, fpr=0.01, plot=true)
@time OUT_GLMNET_LASSO = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_csv, maf=0.001, depth=10, model="GLMNET", glmnet_alpha=1.00, fpr=0.01, plot=true)
### Multi-threaded execution (parallel execution only applicable to model=="GWAlpha"):
using Distributed
Distributed.addprocs(length(Sys.cpu_info())-1)
@everywhere using GWAlpha
@time OUT_GWAS = GWAlpha.PoolGPAS(filename_sync=filename_sync, filename_phen=filename_phen_py, maf=0.001, depth=10, model="GWAlpha", fpr=0.01, plot=true)
The GWAlpha model is defined as α = W(μₐₗₗₑₗₑ-μₐₗₜₑᵣₙₐₜᵢᵥₑ)/σᵧ, where:
- μ is the mean of the beta distribution, Beta(θ) where θ={θ₁,θ₂}
- θ is estimated via maximum likelihood L(θ|Q) ∝ πᵢ₌₁₋ₖf(qᵢ|θ)
- Q = {q₁,...,qₖ} is the cumulative sum of allele frequencies across increasing-phenotypic-value-sorted pools where k is the number of pools
- E(allele|θ) = Beta_cdf(yᵢ',θ) - Beta_cdf(yᵢ₋₁',θ), where yᵢ' ∈ Y'
- Y' is the inverse quantile-normalized into phenotype data such that Y' ∈ [0,1]
- W = 2√{E(allele)*(1-E(allele))} is the penalization for low allele frequency
The linear mixed model is defined as y = Xb + Zu + e, where:
- X [n,p] is the centered matrix of allele frequencies
- Z [n,n] is the square symmetric matrix of relatedness
- y [n,1] is the centered vector of phenotypic values
- no intercept is explicitly fitted but implicitly set at the mean phenotypic value as a consequence of centering y
- u ~ N(0, σ²uI)
- e ~ N(0, σ²eI)
- y ~ N(0, V)
- V = (Z (σ²uI) Z') + (σ²eI)
- variance components (σ²e, σ²u) are estimated via maximum likelihood (ML) or restricted maximum likelihood (REML)
- fixed effects (b) are estimated by solving: (X' * V⁻¹) * (X * X' * V⁻¹)⁻¹ * y
- random effects (u) are estimated by solving: (σ²uI) * Z' * V⁻¹ * (y - (X*b))
Empirical p-values were calculated by modelling the additive allelic effects (α) using a normal distribution with mean and variance estimated using maximum likelihood.
Fournier-Level A, Robin C, Balding DJ (2016). GWAlpha: Genome-Wide estimation of additive effects (Alpha) based on trait quantile distribution from pool-sequencing experiments.