Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jan 15;28(2):206-13.
doi: 10.1093/bioinformatics/btr628. Epub 2011 Nov 13.

Discovering transcription factor regulatory targets using gene expression and binding data

Affiliations

Discovering transcription factor regulatory targets using gene expression and binding data

Mark Maienschein-Cline et al. Bioinformatics. .

Abstract

Motivation: Identifying the target genes regulated by transcription factors (TFs) is the most basic step in understanding gene regulation. Recent advances in high-throughput sequencing technology, together with chromatin immunoprecipitation (ChIP), enable mapping TF binding sites genome wide, but it is not possible to infer function from binding alone. This is especially true in mammalian systems, where regulation often occurs through long-range enhancers in gene-rich neighborhoods, rather than proximal promoters, preventing straightforward assignment of a binding site to a target gene.

Results: We present EMBER (Expectation Maximization of Binding and Expression pRofiles), a method that integrates high-throughput binding data (e.g. ChIP-chip or ChIP-seq) with gene expression data (e.g. DNA microarray) via an unsupervised machine learning algorithm for inferring the gene targets of sets of TF binding sites. Genes selected are those that match overrepresented expression patterns, which can be used to provide information about multiple TF regulatory modes. We apply the method to genome-wide human breast cancer data and demonstrate that EMBER confirms a role for the TFs estrogen receptor alpha, retinoic acid receptors alpha and gamma in breast cancer development, whereas the conventional approach of assigning regulatory targets based on proximity does not. Additionally, we compare several predicted target genes from EMBER to interactions inferred previously, examine combinatorial effects of TFs on gene regulation and illustrate the ability of EMBER to discover multiple modes of regulation.

Availability: All code used for this work is available at http://dinner-group.uchicago.edu/downloads.html.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic of the data preparation and integration steps. Part I: DNA microarray preparation, starting with data normalized by standard procedures, for a hypothetical series of four microarray experiments. (A) Different conditions lead to the definition of different binary behavior dimensions. (B) The behavior of each probe set in each dimension is classified by considering the difference in mean expression levels, relative to the standard deviations (cartoon depicts the mean as a dot, with error bars signifying ±1 SD), as described in the text. Binding sites obtained from standard ChIP-seq or ChIP-chip alignment and peak calling (Part II) are integrated with the gene expression data (Part III). (C) Peaks are assigned potential targets by considering all genes within 100 kb of the binding site. In the figure, this region is given by the brackets around the peak. The orange probe sets, which lie within this region, are assigned as potential targets, and the green probe sets are not. This results in a new data structure of integrated binding and gene expression data. (D) These data, together with the background from (B), are used to (1) identify overrepresented patterns in EMBER, and the resulting expression patterns are used to (2) score probe sets and assign regulatory targets.
Fig. 2.
Fig. 2.
Bias toward BC1000 genes in EMBER versus proximity. Plotted is the log2 ratio of the fraction of potential target BC1000 genes that are selected as targets, to the total fraction of genes that are selected as targets, for each method.
Fig. 3.
Fig. 3.
Distance matrix between different TF binding site group score matrices. Distances are defined by Equation (2), and are between 0 and 1 by construction.
Fig. 4.
Fig. 4.
Visualization of the score matrices for peak groups [ERα] and [e,α]. The values were first scaled to be between 0 and 1, then multiplied by the relative entropy. Down-regulated classifications (−− and −) are scaled below the horizontal axis, and up-regulated classifications (++ and +) are scaled above. The 0 classification is centered around the horizontal axis. Larger bars are over-represented classifications, and smaller bars are under-represented. For example, in the ‘0 to 24 h’ comparison for ATRA treatment (10th behavior dimension from the left), classifications ++ and −− are over-represented for groups [ERα] and [e,α], and classification 0 is the most under-represented (bar height is zero, so it does not appear). The behavior dimensions along the horizontal axis are in the same order as the list of comparisons in supplementary Fig. 2B.

Similar articles

Cited by

References

    1. Arnosti D., Kulkarni M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell Biochem. 2005;94:890–898. - PubMed
    1. Bailey T., Elkan C. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers; pp. 28–36. - PubMed
    1. Banerji J., et al. Expression of a β-globin gene is enhanced by remote SV40 DNA sequences. Cell. 1981;27:299–308. - PubMed
    1. Bar-Joseph Z., et al. Computational discovery of gene modules and regulatory networks. Nat. Biotechnol. 2003;21:1337–1342. - PubMed
    1. Barski A., et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed

Publication types

MeSH terms