The next generation of transcription factor binding site prediction

doi:10.1371/journal.pcbi.1003214

. 2013;9(9):e1003214.

doi: 10.1371/journal.pcbi.1003214. Epub 2013 Sep 5.

The next generation of transcription factor binding site prediction

Anthony Mathelier¹, Wyeth W Wasserman

Affiliations

Affiliation

¹ Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada.

PMID: 24039567
PMCID: PMC3764009
DOI: 10.1371/journal.pcbi.1003214

The next generation of transcription factor binding site prediction

Anthony Mathelier et al. PLoS Comput Biol. 2013.

. 2013;9(9):e1003214.

doi: 10.1371/journal.pcbi.1003214. Epub 2013 Sep 5.

Authors

Anthony Mathelier¹, Wyeth W Wasserman

Affiliation

¹ Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada.

PMID: 24039567
PMCID: PMC3764009
DOI: 10.1371/journal.pcbi.1003214

Abstract

Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. HMM schemas.**
(A) 1st-order HMM schema used in 1st-order TFFMs where the first state represents the background and the following states the consecutive positions within a TFBS. Each state emits a nucleotide with a probability dependent on the nucleotide emitted previously. (B) HMM schema used in detailed TFFMs where each state in the 1st-order HMM is decomposed into four states (one per nucleotide). Transition probabilities reflects the emission probabilities of the 1st-order HMM. It allows the start of a TFBS depending on the nucleotide emitted by the background states.

**Figure 2. Sequence logo representing a TFFM.**
(A) Graphical representation of a TFFM constructed for the Hnf4A TF. Each column corresponds to a position within a TFBS. Each row captures the probabilities of each nucleotide to appear depending on the nucleotide found at the previous position. The opacity of a case represents the probability of hitting this case depending on the probability of appearance of the corresponding nucleotide at the previous position (the higher the opacity, the higher the probability). (B) The summary logo compacts all the information to summarize the dense logo in (A). (C) Zooming in on the dense TFFM logo for positions 10 to 13 (corresponding to the box in (A)). We observe that a “C” is more likely to appear at position 12 if nucleotide “T” was found at position 11 whereas a “T” is more likely to appear at position 12 if nucleotide “G” was found at position 11.

**Figure 3. Performance comparison between TFFMs and weight matrices.**
For the 96 ChIP-seq data sets obtaining an % for at least one method (using a genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. The four types of models were used (1st-order TFFM, detailed TFFM, PWM, and DWM). By considering a similar performance between two methods when the AUC ratio is %, we plot at the top of the figure the region where the weight matrices (WMs) best perform, where the TFFMs best perform, and where they are similar. AUC ratios are ranked from the least to the most favourable to the TFFMs.

formula image — **Figure 3. Performance comparison between TFFMs and weight matrices.**
For the 96 ChIP-seq data sets obtaining an % for at least one method (using a genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. The four types of models were used (1st-order TFFM, detailed TFFM, PWM, and DWM). By considering a similar performance between two methods when the AUC ratio is %, we plot at the top of the figure the region where the weight matrices (WMs) best perform, where the TFFMs best perform, and where they are similar. AUC ratios are ranked from the least to the most favourable to the TFFMs.

**Figure 4. Performances comparison between 0-order TFFMs, other TFFMs, and weight matrices.**
For the 96 ChIP-seq data sets used in Figure 3 (using genomic background), the ratio between the AUC value using a specific model and the best AUC obtained is plotted. (A) The three types of TFFMs were used (1st-order, detailed, and 0-order TFFMs). AUC ratios are ranked from the least to the most favourable to the 1st-order and detailed TFFMs. We observe that the 1st-order and detailed TFFMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences. (B) 0-order TFFMs and WMs were used. AUC ratios are ranked from the least to the most favourable to the 0-order TFFM. We observe that the WMs outperform the 0-order TFFMs when discriminating ChIP-seq sequences from genomic background sequences.

**Figure 5. Correlations between prediction scores and ChIP-seq peak scores or binding affinities.**
(A) ChIP-seq signal values obtained from ENCODE data sets were compared to prediction values obtained with the four different predictive methods. The distribution of Spearman's correlation values from all data sets are given for 1st-order TFFMs, detailed TFFMs, PWMs, and DWMs. An over-representation of Spearman's correlations around 1 (perfect correlation) is found for the four methods. (B) Pearson correlation between scores obtained using the different predictive methods and DNA-binding affinities from .

**Figure 6. ROC curve analysis of JunD ChIP-seq data.**
TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs outperform the other models since the corresponding ROC curves are above ROC curves corresponding to other models.

**Figure 7. ROC curve analysis of STAT4 and STAT6 ChIP-seq data.**
TFFMs allowing a flexible length motif have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs on STAT4 (A) and STAT6 (B) ChIP-seq data. Flexible TFFMs do not significantly perform better than fixed-length TFFMs. DWMs, PWMs, and GLAM2 produce a lower discriminative power than the TFFMs.

**Figure 8. ROC curve analysis of MafK ChIP-seq data.**
TFFMs allowing a motif with a flexible edge have been compared to PWMs, DWMs, GLAM2, and fixed-length TFFMs. Flexible TFFMs perform slightly better than fixed-length TFFMs and both outperform the other models.

See this image and copyright information in PMC

Cited by

A map of direct TF-DNA interactions in the human genome.
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. Gheorghe M, et al. Nucleic Acids Res. 2019 Feb 28;47(4):e21. doi: 10.1093/nar/gky1210. Nucleic Acids Res. 2019. PMID: 30517703 Free PMC article.
Transcription factor-DNA binding: beyond binding site motifs.
Inukai S, Kock KH, Bulyk ML. Inukai S, et al. Curr Opin Genet Dev. 2017 Apr;43:110-119. doi: 10.1016/j.gde.2017.02.007. Epub 2017 Mar 27. Curr Opin Genet Dev. 2017. PMID: 28359978 Free PMC article. Review.
The long noncoding RNA lnc-HLX-2-7 is oncogenic in Group 3 medulloblastomas.
Katsushima K, Lee B, Kunhiraman H, Zhong C, Murad R, Yin J, Liu B, Garancher A, Gonzalez-Gomez I, Monforte HL, Stapleton S, Vibhakar R, Bettegowda C, Wechsler-Reya RJ, Jallo G, Raabe E, Eberhart CG, Perera RJ. Katsushima K, et al. Neuro Oncol. 2021 Apr 12;23(4):572-585. doi: 10.1093/neuonc/noaa235. Neuro Oncol. 2021. PMID: 33844835 Free PMC article.
JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles.
Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D, Khan A, Fornes O, Gundersen S, Johansen M, Hovig E, Lenhard B, Sandelin A, Wasserman WW, Parcy F, Mathelier A. Rauluseviciute I, et al. Nucleic Acids Res. 2024 Jan 5;52(D1):D174-D182. doi: 10.1093/nar/gkad1059. Nucleic Acids Res. 2024. PMID: 37962376 Free PMC article.
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles.
Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, Buchman S, Chen CY, Chou A, Ienasescu H, Lim J, Shyr C, Tan G, Zhou M, Lenhard B, Sandelin A, Wasserman WW. Mathelier A, et al. Nucleic Acids Res. 2014 Jan;42(Database issue):D142-7. doi: 10.1093/nar/gkt997. Epub 2013 Nov 4. Nucleic Acids Res. 2014. PMID: 24194598 Free PMC article.

See all "Cited by" articles

References

1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science (New York, NY) 324: 1720–3. - PMC - PubMed
1. Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nature reviews Genetics 5: 276–87. - PubMed
1. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–32. - PMC - PubMed
1. Bernard V, Lecharny A, Brunaud V (2010) Improved detection of motifs with preferential location in promoters. Genome 53: 739–52. - PubMed
1. Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome research 22: 1723–34. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science (New York, NY) 324: 1720–3. - PMC - PubMed

[2] Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science (New York, NY) 324: 1720–3. - PMC - PubMed

[3] Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nature reviews Genetics 5: 276–87. - PubMed

[4] Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nature reviews Genetics 5: 276–87. - PubMed

[5] Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–32. - PMC - PubMed

[6] Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, et al. (2007) Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–32. - PMC - PubMed

[7] Bernard V, Lecharny A, Brunaud V (2010) Improved detection of motifs with preferential location in promoters. Genome 53: 739–52. - PubMed

[8] Bernard V, Lecharny A, Brunaud V (2010) Improved detection of motifs with preferential location in promoters. Genome 53: 739–52. - PubMed

[9] Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome research 22: 1723–34. - PMC - PubMed

[10] Arvey A, Agius P, Noble WS, Leslie C (2012) Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome research 22: 1723–34. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The next generation of transcription factor binding site prediction

Affiliation

The next generation of transcription factor binding site prediction

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous