Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 May;48(5):488-96.
doi: 10.1038/ng.3539. Epub 2016 Apr 4.

Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin

Affiliations

Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin

Sean Whalen et al. Nat Genet. 2016 May.

Abstract

Discriminating the gene target of a distal regulatory element from other nearby transcribed genes is a challenging problem with the potential to illuminate the causal underpinnings of complex diseases. We present TargetFinder, a computational method that reconstructs regulatory landscapes from diverse features along the genome. The resulting models accurately predict individual enhancer-promoter interactions across multiple cell lines with a false discovery rate up to 15 times smaller than that obtained using the closest gene. By evaluating the genomic features driving this accuracy, we uncover interactions between structural proteins, transcription factors, epigenetic modifications, and transcription that together distinguish interacting from non-interacting enhancer-promoter pairs. Most of this signature is not proximal to the enhancers and promoters but instead decorates the looping DNA. We conclude that complex but consistent combinations of marks on the one-dimensional genome encode the three-dimensional structure of fine-scale regulatory interactions.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1
Predictive power of promoter proximal genomic features. Ratio of various ChIP-seq signals anchored at the transcription start sites (TSS) of interacting vs non-interacting promoters in K562, along with the log base 2 fold change (L2FC) and p-value corrected for multiple testing (q). All promoters have activating chromatin marks and show transcription. The top row shows expected patterns for promoter-associated marks at the TSS, such as a high ratio of H3K4me3 to H3K4me1. Some of these marks are enriched in interacting promoters, while others such as K4 methylation patterns are not. The second row shows TSS proximal patterns for several proteins associated with chromatin looping. CTCF and RAD21 are enriched at interacting promoters, while transcription factors CUX1 and HCFC1 are enriched and depleted, respectively.
Figure 2
Figure 2
Binding co-occurrence at enhancers enriches looping interactions. Ratio of CTCF and RAD21 ChIP-seq signals occurring within interacting enhancers vs non-interacting enhancers, anchored at peaks for CTCF, RAD21, and the transcription factors CUX1 and HCFC1 for the K562 cell line. CUX1 and HCFC1 are highly enriched at loop-associated enhancers when co-occurring with CTCF and RAD21. The context-dependence of protein binding is demonstrated by RAD21, which is not enriched at interacting promoters (Figure 1). Note that CTCF and RAD21 are already enriched at their respective peaks within interacting enhancers, but are further enriched when anchored at CUX1 or HCFC1. This visualizes how the co-occurrence of certain transcription factors increases the likelihood of looping interactions beyond CTCF or RAD21 peaks alone, helps interpret the predictive importance estimated by TargetFinder, and can identify novel looping factors.
Figure 3
Figure 3
Figure 4
Figure 4
The TargetFinder pipeline. Features are generated from hundreds of diverse datasets for pairs of enhancers and promoters of expressed genes found to have significant Hi-C interactions (positives), as well as random pairs of enhancers and promoters without significant interactions (negatives). These labeled samples are used to train an ensemble classifier that predicts whether enhancer-promoter pairs from new or held-out samples interact, as well as estimate the importance of each feature for accurate prediction. Classifier predictions are probabilities, and a decision threshold (commonly 0.5 but may be adjusted) converts these to positive or negative prediction labels. This figure excludes selection of minimal predictor sets and evaluation of the accuracy of output predictions using held-out Hi-C interaction data.
Figure 5
Figure 5
TargetFinder performance by cell line, model type, and number of features. (a) Cross-validated performance of TargetFinder predictions for a baseline (random guessing null) model, a linear Support Vector Machine, a single decision tree, and a boosted ensemble of decision trees. Performance is given as a balance of precision and recall (F1), averaging 83% across cell lines and corresponding to a mean FDR of 12%. Ensemble methods utilize complex interactions between features to greatly increase the accuracy of predicted interactions. Performance is also high on a combined cell line comprised of K562, GM12878, HeLa-S3, and IMR90 datasets, with features restricted to datasets shared by all cell lines. (b) Performance of boosted trees using features for enhancers and promoters only (E/P), promoters and extended enhancers (EE/P), and enhancers/promoters plus the window between (E/P/W). (c) Recursive feature elimination (Methods) evaluates predictor subsets of size 1 up to the maximum per cell line and increasing by powers of 2 for computational efficiency. Near optimal performance was achieved using ~16 predictors for lineage-specific models as well as the combined model, while lower but acceptable performance required 8 predictors. The maximum feature subset size shown is 32 to enhance visibility of smaller feature subsets. NHEK lacks a measurement at subset size 32 since it has fewer than 32 total features. (Error bars = s.e.m.)
Figure 6
Figure 6
Predictive importance of genomic features by region. Predictive importance (Methods) across cell lines and regions. Importance is discretized by quartiles, and grid entries are colored black when a dataset is unavailable in a cell line. The highest average importance is assigned to features in the window region, followed by promoters. Promoter methylation and POLR2A are more important in the the combined “4 Lines” classifier (K562/GM12878/HeLa-S3/IMR90) than individual cell lines. Highly predictive features such as CAGE are available in most but not all cell lines needed for inclusion in the combined model. Certain TFs are available in multiple cell lines but are not universally predictive, such as FOS in the window region. Other TFs are only available in a single cell line but are highly predictive, such as WHSC1 and ZMIZ1 in the window region of K562 and RUNX3 in the window region of GM12878.
Figure 7
Figure 7
Feature values and predictive importance for enhancer, promoter, and window regions. Despite having the lowest feature values, the predictive importance of the window dominates that of enhancer and promoter regions. (Error bars = 1.5 * interquartile range)
Figure 8
Figure 8
Identification of complex interactions between DNA-binding proteins and epigenetic marks. Scatterplot of univariate feature significance (two-sample Kolmogorov-Smirnov test) versus multivariate feature importance (estimated via a boosted trees classifier) for three cell lines. In order to highlight datasets that are predictive in combination with other features (multivariate) but not predictive alone (univariate), only features with a multivariate rank less than 25 and univariate rank greater than 25 are shown. For example, the lower right corner of K562 shows H2AZ, WHSC1, CUX1, and SUMO2 are among the top 10 predictive features when the co-localization of other proteins is known. H2AZ has similar context-dependent importance in GM12878 and HeLa-S3. Many features predictive in one or more cell lines are not assayed uniformly and thus cannot be included in the combined model (ex: HCFC1, CUX1, SUMO2).

Comment in

Similar articles

Cited by

References

    1. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–1759. - PMC - PubMed
    1. Lomelin D, Jorgenson E, Risch N. Human genetic variation recognizes functional elements in noncoding sequence. Genome Res. 2010;20:311–319. - PMC - PubMed
    1. Alexandrov NN, et al. Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol. 2006;60:69–85. - PubMed
    1. Hillier LW, et al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 2008;5:183–188. - PubMed
    1. Massouras A, et al. Genomic variation and its impact on gene expression in Drosophila melanogaster. PLoS Genet. 2012;8:e1003055. - PMC - PubMed

Publication types