Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 16;19(4):e0298906.
doi: 10.1371/journal.pone.0298906. eCollection 2024.

Learning epistatic polygenic phenotypes with Boolean interactions

Affiliations

Learning epistatic polygenic phenotypes with Boolean interactions

Merle Behr et al. PLoS One. .

Abstract

Detecting epistatic drivers of human phenotypes is a considerable challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests for interactions, based on a stabilized likelihood ratio test, by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that probabilisticly quantify improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline in two case studies using data from the UK Biobank: predicting red hair and multiple sclerosis (MS). In the case of predicting red hair, epiTree recovers known epistatic interactions surrounding MC1R and novel interactions, representing non-linearities not captured by logistic regression models. In the case of predicting MS, a more complex phenotype than red hair, epiTree rankings prioritize novel interactions surrounding HLA-DRB1, a variant previously associated with MS in several populations. Taken together, these results highlight the potential for epiTree rankings to help reduce the design space for follow up experiments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Illustration of the two-step procedure, from gene level analysis (A.) to variant level analysis (B.), for extraction of candidate interactions.
A. First row from left to right: from approx. 107 variants (a.) tissue specific transcripts for approx. 104 genes (b.) are imputed using the software PrediXcan. Then the epiTree pipeline is applied to extract interactions for the gene expression features. B. Second row from left to right: for all genes that appear in interactions from the first step (f.), variants within 1MB of the start or end of an interacting gene are extracted (g.). Then the epiTree pipeline is applied to extract interactions for the variant features. For both, gene and variant level analysis, the epiTree pipeline first applies iRF (c./h.) to extract candidate interactions and then calculates PCS p-values for these candidate interactions on hold-out test data (d./i.). This results in the final set of selected gene-level interactions (e.) and variant level interactions (j.).
Fig 2
Fig 2. Example for the decision tree-based null model (no-epistasis) shown in top row (A.) and alternative model (epistasis) shown in bottom row (B.), for gene expression features A = ASIP and B = DEF8, which are tested for via the PCS p-values.
The models where fitted using the CART algorithm [33] on the training data. The decimal digits at the tip nodes correspond to the predicted probability of red hair. The percentage at the tip nodes corresponds to the percentage of training observations falling into this tip node.
Fig 3
Fig 3. ROC curves of iRF prediction model and competitors on hold out test data.
“Lasso” stands for a logistic regression model with an additive L1 penalty term on the parameter vector, i.e., a lasso type estimator. The lambda tuning parameter was selected via cross validation using the cv.glmnet R function from the glmnet R package. The “ranger” competitor corresponds to the random forest implementation of the R package ranger with default parameters. Left (A.): using the gene expression features to train a model which predicts ‘red hair’. Right (B.): using the variant features to train a model which predicts ‘red hair’. P-values testing for a difference between iRF’s ROC curve and the respective competitors are computed using the R package pROC and function roc.test, using DeLong’s Algorithm [63].
Fig 4
Fig 4. Location of the coding region for the 8 chromosome 16 genes which appear in the stable gene level interactions found by iRF, together with the location of the MC1R gene.
Fig 5
Fig 5. List of stable gene level interactions found by iRF (stability score > 0.5).
The fist column shows the prediction error (defined as cross-entropy, see supplementary Section S1.1.6) on the test data of the learned CART models for both, no-epistasis (NULL, orange) and epistasis (alternative, gray). The second column shows the PCS p-value on a -log10 scale and the numeric value is shown on the very right, up to two significant digits. The black vertical line in the first column shows the prediction error achieved by iRF using all the gene features simultaneously. The chromosome of the respective gene is shown in parenthesis. The 8 genes on chromosome 16 all have their coding region in the vicinity of the MC1R gene, as shown in Fig 4. Note that the prediction error of the Null model can be less than the prediction error of the alternative model, as they are evaluated on hold-out test data. Whenever this happens the PCS p-values is 1 by construction.
Fig 6
Fig 6. Same as Fig 5 for the top 10 order two variant level interactions.
Fig 7
Fig 7
Response surface for ASIPTUBB3, right plot (e.): smoothed test data is shown with colored hexagons providing percentage of red hair as a color code for the test data in specific hexagon (hexagons are drops when they contain less than 5 data points); left plot (a.—d.): response surfaces for fitted models; top (a./b.): logistic regression model, bottom (c./d.): CART based model, right (b./d.): epistasis model, left (a./c.): non-epistasis model.
Fig 8
Fig 8. Response surface for ASIPDBNDD1, otherwise as Fig 7.

Similar articles

Cited by

References

    1. Bateson W. Mendel’s Principles of Heredity. Cambridge Univ. Press; 1909.
    1. Ritchie MD. Finding the Epistasis Needles in the Genome-Wide Haystack. In: Epistasis. Methods in Molecular Biology (Methods and Protocols). vol. 1253. New York: Humana Press; 2015. p. 19–33. - PubMed
    1. Bell JT, Timpson NJ, Rayner NW, Zeggini E, Frayling TM, Hattersley AT, et al.. Genome-Wide Association Scan Allowing for Epistasis in Type 2 Diabetes: 2D GWA Scan of Type 2 Diabetes. Annals of Human Genetics. 2011;75(1):10–19. doi: 10.1111/j.1469-1809.2010.00629.x - DOI - PMC - PubMed
    1. Van Steen K, Moore JH. How to Increase Our Belief in Discovered Statistical Interactions via Large-Scale Association Studies? Human Genetics. 2019;138(4):293–305. doi: 10.1007/s00439-019-01987-w - DOI - PMC - PubMed
    1. Nag A, McCarthy MI, Mahajan A. Large-Scale Analyses Provide No Evidence for Gene-Gene Interactions Influencing Type 2 Diabetes Risk. Diabetes. 2020;69(11):2518–2522. doi: 10.2337/db20-0224 - DOI - PMC - PubMed