Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Oct 26;28(2):337-50.
doi: 10.1016/j.molcel.2007.09.027.

A universal framework for regulatory element discovery across all genomes and data types

Affiliations

A universal framework for regulatory element discovery across all genomes and data types

Olivier Elemento et al. Mol Cell. .

Abstract

Deciphering the noncoding regulatory genome has proved a formidable challenge. Despite the wealth of available gene expression data, there currently exists no broadly applicable method for characterizing the regulatory elements that shape the rich underlying dynamics. We present a general framework for detecting such regulatory DNA and RNA motifs that relies on directly assessing the mutual information between sequence and gene expression measurements. Our approach makes minimal assumptions about the background sequence model and the mechanisms by which elements affect gene expression. This provides a versatile motif discovery framework, across all data types and genomes, with exceptional sensitivity and near-zero false-positive rates. Applications from yeast to human uncover putative and established transcription-factor binding and miRNA target sites, revealing rich diversity in their spatial configurations, pervasive co-occurrences of DNA and RNA motifs, context-dependent selection for motif avoidance, and the strong impact of posttranscriptional processes on eukaryotic transcriptomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Examples of dependencies between motifs and expression
(A) FIRE seeks motifs whose pattern of presence and absence across all promoters (or 3′UTRs) is highly informative about the expression profile for the same genes. The left panel presents a simple example for discrete expression data (e.g., a clustering partition). Here, knowing whether the motif is present or absent, provides significant information regarding the identity of the cluster to which the gene is assigned. The right panel presents a simple example for continuous data (e.g., expression log-ratios from a single microarray experiment). Again, knowing whether the motif is present or absent, provides significant information regarding the differential expression of the corresponding gene. (B) FIRE also uses mutual information to determine whether a predicted motif has a position bias or an orientation bias. If the distance between the motif and the TSS is significantly informative about gene expression (left panel), a position bias is reported. If the occurrences of the motif on one strand (blue triangles in right panel) are significantly informative about gene expression while the occurrences on the other strand (white triangles) are randomly scattered, an orientation bias is reported.
Figure 2
Figure 2. Optimization process for the motif matching the binding site for Rap1 in yeast
Starting from the seed, FIRE gradually finds motif definitions that are more informative about gene expression (upper panel). These more informative motif definitions are more similar to the known binding site for Rap1 (middle panel), as measured using CompareACE (Hughes et al., 2000). They are also more conserved (with respect to S. bayanus, lower panel).
Figure 3
Figure 3. All predicted DNA and RNA motifs for the yeast gene clustering partition
Columns correspond to gene clusters and rows correspond to predicted motifs arranged into putative functional modules. For each cluster, the most significant GO enrichment is shown at the top. The yellow color-map indicates over-representation of a motif in a given cluster; significant over-representation (p<0.05 after Bonferroni correction) is highlighted using red frames. Similarly, the blue color-map and blue frames indicate under-representation. For each motif, we indicate 1) location, i.e., 5′ upstream region or 3′UTR, 2) mutual information (MI) value, 3) Z-score associated with the MI value, calculated using 10,000 randomization tests, 4) robustness score ranging from 0/10 to 10/10 obtained from 10 jack-knife trials of randomly removing 1/3 of the genes and reassessing the statistical significance of the resulting MI values, 5) position bias indicator (“Y” if a position bias is observed), 6) orientation bias indicator, 7) conservation index, 8) seed that gave rise to the motif, and 9) name of the closest known motif in our motif database (with CompareACE score > 0.8). For more details, see the Supplementary Methods section about FIRE p-value heat-maps.
Figure 4
Figure 4. Interactions among all predicted yeast motifs
Each row/column corresponds to a predicted motif. The color-map indicates the level by which the presence of one motif implies the presence (light color-map) or the absence (dark color-map) of another motif within the same promoter, as quantified by their interaction information (see Supplementary Methods). Very light colors indicate strong positive co-occurrences between pairs of motifs that have further been used to construct putative functional modules, indicated on the figure (and in Figure 3). Very dark colors indicate that the two motifs tend to avoid being present within the same promoter. Statistically significant information values (p<1e-4) that involve homotypic motif pairs are highlighted using blue (DNA-DNA) and pink (RNA-RNA) frames, while those that involve heterotypic pairs (a DNA motif and a RNA motif) are highlighted using green frames. Significant spatial co-localization between pairs of motifs are denoted with ‘+’. For more details, see the Supplementary Methods section about FIRE interaction heat-maps.
Figure 5
Figure 5. Single array analysis in yeast
All informative motifs obtained from a single microarray experiment, analyzing the genome-wide response of a MSN2/MSN4 mutant strain when treated with 0.3mM H2O2. Rows correspond to motifs and columns to (equally sized) groups of genes with a similar expression value (log-ratio). One of these motifs (PAC) is associated with down-regulated genes, while the three others (Rpn4, Yap1, Puf3) are associated with different populations of up-regulated genes.
Figure 6
Figure 6. All predicted motifs for the P. falciparum phase dataset
Motifs shown here are highly informative about the P. falciparum expression phase profile that indicates, for each of ~2,700 periodically expressed genes, the timing of its maximal expression during the parasite’s intra-erythrocytic developmental cycle (Bozdech et al., 2003). Rows correspond to motifs while columns to (equally sized) groups of genes with a similar phase value. A heat-map representing the expression profile of all periodically expressed genes (ordered by phase) is shown at the top. Motifs are sorted according to the phase range in which they are most over-represented.
Figure 7
Figure 7. Predicted DNA and RNA motifs for the human gene clustering partition
FIRE was applied to a clustering partition of 17,390 human genes, clustered based on tissue expression data (Su et al., 2004). The format of this figure is identical to the format of Figure 3. Due to space limitations, only a selection of the predicted motifs is presented. The complete figure is given in Figure S11. Motif names are reported based on the closest known motif in JASPAR or TRANSFAC, with CompareACE score > 0.8. The miRNAs whose 5′ extremity matches 3′UTR elements with high specificity are also reported (see Supplementary Methods).

Similar articles

Cited by

References

    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. - PMC - PubMed
    1. Beer M, Tavazoie S. Predicting gene expression from sequence. Cell. 2004;117:185–198. - PubMed
    1. Bolognese F, Wasner M, Dohna CL, Gurtner A, Ronchi A, Muller H, Manni I, Mossner J, Piaggio G, Mantovani R, Engeland K. The cyclin B2 promoter depends on NF-Y, a trimer whose CCAAT-binding activity is cell-cycle regulated. Oncogene. 1999;18:1845–1853. - PubMed
    1. Bozdech Z, Llinas M, Pulliam BL, Wong ED, Zhu J, DeRisi JL. The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol. 2003;1:E5. - PMC - PubMed
    1. Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990;212:563–578. - PubMed

Publication types

MeSH terms