Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 May 27:10:23.
doi: 10.1186/1471-2156-10-23.

Disentangling molecular relationships with a causal inference test

Affiliations

Disentangling molecular relationships with a causal inference test

Joshua Millstein et al. BMC Genet. .

Abstract

Background: There has been intense effort over the past couple of decades to identify loci underlying quantitative traits as a key step in the process of elucidating the etiology of complex diseases. Recently there has been some effort to coalesce non-biased high-throughput data, e.g. high density genotyping and genome wide RNA expression, to drive understanding of the molecular basis of disease. However, a stumbling block has been the difficult question of how to leverage this information to identify molecular mechanisms that explain quantitative trait loci (QTL). We have developed a formal statistical hypothesis test, resulting in a p-value, to quantify uncertainty in a causal inference pertaining to a measured factor, e.g. a molecular species, which potentially mediates a known causal association between a locus and a quantitative trait.

Results: We treat the causal inference as a 'chain' of mathematical conditions that must be satisfied to conclude that the potential mediator is causal for the trait, where the inference is only as good as the weakest link in the chain. P-values are computed for the component conditions, which include tests of linkage and conditional independence. The Intersection-Union Test, in which a series of statistical tests are combined to form an omnibus test, is then employed to generate the overall test result. Using computer simulated mouse crosses, we show that type I error is low under a variety of conditions that include hidden variables and reactive pathways. We show that power under a simple causal model is comparable to other model selection techniques as well as Bayesian network reconstruction methods. We further show empirically that this method compares favorably to Bayesian network reconstruction methods for reconstructing transcriptional regulatory networks in yeast, recovering 7 out of 8 experimentally validated regulators.

Conclusion: Here we propose a novel statistical framework in which existing notions of causal mediation are formalized into a hypothesis test, thus providing a standard quantitative measure of uncertainty in the form of a p-value. The method is theoretically and computationally accessible and with the provided software may prove a useful tool in disentangling molecular relationships.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Estimated distributions of the test statistic, Z*, under the null for the equivalence test of conditional independence between the locus and the trait. An additive effect for a single biallelic locus under a simple independence model was simulated for both the gene and the trait under normally distributed errors. Minor allele frequency = .2. Sample size = 1000. B = 500.
Figure 2
Figure 2
Negative log 10 p-values for the semi-parametric and non-parametric versions of the CIT applied to 10,000 replicate data sets simulated under the independence model. For each replicate of 100 observations the genetic variance for the gene and the trait were each randomly sampled from a uniform distribution ranging from 8 to 32 percent. The gene and trait were normally distributed and a biallelic locus was simulated with allele frequency of .5.
Figure 3
Figure 3
Schematic diagram of the CIT. A) Study subjects are sampled for i) a trait of interest, T (for example, cholesterol or fat mass), ii) a potential mediating factor, G (for example, an mRNA or protein concentration), and iii) genotype at a polymorphic locus, L, that is thought to affect both G and T. B) The four component tests of the CIT are conducted yielding four corresponding p-values. (Plots are shown as a conceptual device, see text for details of the actual tests.) Associations in 1 through 3 but not 4 are consistent with causal mediation. C) The largest of the four p-values becomes the omnibus p-value, the final result of the CIT.
Figure 4
Figure 4
Four causal inference strategies, CC, CIT, BNC, and AIC were applied to simulated data under five distinct causal models, A-E, shown above. Here a genotype marker at a specific locus is denoted by L, a gene corresponding to measured transcript abundance is denoted by G, and a measured clinical trait is denoted by T. H denotes an unmeasured molecular trait.
Figure 5
Figure 5
Marginal effect sizes (sample R2 values) for all six causality scenarios. Causal models are, causal (C), reactive (R), independent (I), hidden variable affecting both traits (H), and no associations between genotypes and traits (Null). R2 values are shown for all replicate datasets within each causal scenario.
Figure 6
Figure 6
Type I error and power comparison between causality methods derived from computer simulated F2 mouse crosses. For each autosome of each replicate cross of N = 1000 total crosses, a clinical trait and potential mediating trait were simulated under a variety of true causal scenarios. For each scenario, a wide range of positive and negative effect sizes were randomly selected for each chromosome of each cross. 'Neighbors' denote chromosome-specific QTL peak pairs. Causal models are, causal (C), reactive (R), independent (I), hidden variable affecting both traits (H), and no associations between genotypes and traits (Null).
Figure 7
Figure 7
Type I error and power comparison between causality methods derived from computer simulated F2 mouse crosses. For each autosome of each replicate cross of N = 1000 total crosses, a clinical trait and potential mediating trait were simulated under a variety of true causal scenarios. For each scenario, a wide range of positive and negative effect sizes were randomly selected for each chromosome of each cross. 'Neighbors' denote chromosome-specific QTL peak pairs. Causal models are, causal (C), reactive (R), independent (I), hidden variable affecting both traits (H), and no associations between genotypes and traits (Null). Filtering criteria were applied such that only neighbors where both QTL peaks achieved a p-value of .001 or smaller were tested. For the AIC and BNC methods, a bootstrap consistency of .7 was required to accept the causal call. Note that 'power' is estimated ignoring those gene-trait pairs that did not both meet the p-value significance threshold.
Figure 8
Figure 8
Type I error and power comparison between causality methods derived from computer simulated F2 mouse crosses. For each autosome of each replicate cross of N = 1000 total crosses, a clinical trait and potential mediating trait were simulated under a variety of true causal scenarios. For each scenario, a wide range of positive and negative effect sizes were randomly selected for each chromosome of each cross. 'Neighbors' denote chromosome-specific QTL peak pairs. Causal models are, causal (C), reactive (R), independent (I), hidden variable affecting both traits (H), and no associations between genotypes and traits (Null). Unlike all other results reported here, the causal model was tested using the gene QTL peak marker and the reactive model was tested using the clinical trait QTL peak marker.
Figure 9
Figure 9
Schematic of an eQTL hotspot, a locus identified to affect transcript abundances for many genes. Directly affected are genes in cis, some of which, the 'cis regulators', propagate the 'perturbation' to other genes. Among cis regulated genes are the putative cis regulators identified by BN.full (yellow) as well as targeted in vivo experimentation (red).
Figure 10
Figure 10
CIT reconstructed causal transcriptional regulatory network. Yellow circles indicate putative hotspot regulators from Table 1, and red circles indicate those that have been experimentally validated.

Similar articles

Cited by

References

    1. Chen LS, Emmert-Streib F, Storey JD. Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol. 2007;8(10):R219. doi: 10.1186/gb-2007-8-10-r219. - DOI - PMC - PubMed
    1. Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK. et al.Variations in DNA elucidate molecular networks that cause disease. Nature. 2008;452(7186):429–435. doi: 10.1038/nature06757. - DOI - PMC - PubMed
    1. Davey Smith G, Ebrahim S. 'Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol. 2003;32(1):1–22. doi: 10.1093/ije/dyg070. - DOI - PubMed
    1. Didelez V, Sheehan N. Mendelian randomization as an instrumental variable approach to causal inference. Stat Methods Med Res. 2007;16(4):309–330. doi: 10.1177/0962280206077743. - DOI - PubMed
    1. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S. et al.Genetics of gene expression and its effect on disease. Nature. 2008;452(7186):423–428. doi: 10.1038/nature06758. - DOI - PubMed