Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;9(9):e1003214.
doi: 10.1371/journal.pcbi.1003214. Epub 2013 Sep 5.

The next generation of transcription factor binding site prediction

Affiliations

The next generation of transcription factor binding site prediction

Anthony Mathelier et al. PLoS Comput Biol. 2013.

Abstract

Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. HMM schemas.
(A) 1st-order HMM schema used in 1st-order TFFMs where the first state represents the background and the following states the consecutive positions within a TFBS. Each state emits a nucleotide with a probability dependent on the nucleotide emitted previously. (B) HMM schema used in detailed TFFMs where each state in the 1st-order HMM is decomposed into four states (one per nucleotide). Transition probabilities reflects the emission probabilities of the 1st-order HMM. It allows the start of a TFBS depending on the nucleotide emitted by the background states.
Figure 2
Figure 2. Sequence logo representing a TFFM.
(A) Graphical representation of a TFFM constructed for the Hnf4A TF. Each column corresponds to a position within a TFBS. Each row captures the probabilities of each nucleotide to appear depending on the nucleotide found at the previous position. The opacity of a case represents the probability of hitting this case depending on the probability of appearance of the corresponding nucleotide at the previous position (the higher the opacity, the higher the probability). (B) The summary logo compacts all the information to summarize the dense logo in (A). (C) Zooming in on the dense TFFM logo for positions 10 to 13 (corresponding to the box in (A)). We observe that a “C” is more likely to appear at position 12 if nucleotide “T” was found at position 11 whereas a “T” is more likely to appear at position 12 if nucleotide “G” was found at position 11.
Figure 3
Figure 3. Performance comparison between TFFMs and weight matrices.
For the 96 ChIP-seq data sets obtaining an formula image% for at least one method (using a genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. The four types of models were used (1st-order TFFM, detailed TFFM, PWM, and DWM). By considering a similar performance between two methods when the AUC ratio is formula image%, we plot at the top of the figure the region where the weight matrices (WMs) best perform, where the TFFMs best perform, and where they are similar. AUC ratios are ranked from the least to the most favourable to the TFFMs.
Figure 4
Figure 4. Performances comparison between 0-order TFFMs, other TFFMs, and weight matrices.
For the 96 ChIP-seq data sets used in Figure 3 (using genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. (A) The three types of TFFMs were used (1st-order, detailed, and 0-order TFFMs). AUC ratios are ranked from the least to the most favourable to the 1st-order and detailed TFFMs. We observe that the 1st-order and detailed TFFMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences. (B) 0-order TFFMs and WMs were used. AUC ratios are ranked from the least to the most favourable to the 0-order TFFM. We observe that the WMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences.
Figure 5
Figure 5. Correlations between prediction scores and ChIP-seq peak scores or binding affinities.
(A) ChIP-seq signal values obtained from ENCODE data sets were compared to prediction values obtained with the four different predictive methods. The distribution of Spearman's correlation values from all data sets are given for 1st-order TFFMs, detailed TFFMs, PWMs, and DWMs. An over-representation of Spearman's correlations around 1 (perfect correlation) is found for the four methods. (B) Pearson correlation between scores obtained using the different predictive methods and DNA-binding affinities from .
Figure 6
Figure 6. ROC curve analysis of JunD ChIP-seq data.
TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs outperform the other models since the corresponding ROC curves are above ROC curves corresponding to other models.
Figure 7
Figure 7. ROC curve analysis of STAT4 and STAT6 ChIP-seq data.
TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs on STAT4 (A) and STAT6 (B) ChIP-seq data. Flexible TFFMs do not significantly perform better than fixed-length TFFMs. DWMs, PWMs, and GLAM2 produce a lower discriminative power than the TFFMs.
Figure 8
Figure 8. ROC curve analysis of MafK ChIP-seq data.
TFFMs allowing a motif with a flexible edge have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs perform slightly better than fixed-length TFFMs and both outperform the other models.

Similar articles

Cited by

References

    1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science (New York, NY) 324: 1720–3. - PMC - PubMed
    1. Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nature reviews Genetics 5: 276–87. - PubMed
    1. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–32. - PMC - PubMed
    1. Bernard V, Lecharny A, Brunaud V (2010) Improved detection of motifs with preferential location in promoters. Genome 53: 739–52. - PubMed
    1. Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome research 22: 1723–34. - PMC - PubMed

Publication types

LinkOut - more resources