Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb 11:10:57.
doi: 10.1186/1471-2105-10-57.

A reexamination of information theory-based methods for DNA-binding site identification

Affiliations

A reexamination of information theory-based methods for DNA-binding site identification

Ivan Erill et al. BMC Bioinformatics. .

Abstract

Background: Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods.

Results: Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results.

Conclusion: We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Search efficiency in the E. coli genome. ROC curves for different IT-based binding site search methods attempting to locate known LexA, Fur, CRP and Fis sites on the E. coli genome. The plot is scaled to encompass a 1/10 true to false positive ratio for the transcription factor with the largest number of known sites (CRP; 210 sites). Vertical arrows indicate this same ratio for all transcription factors.
Figure 2
Figure 2
Search efficiency for E. coli CRP sites in a skewed random background. ROC curves for search methods trying to locate 210 CRP binding sites on randomly generated backgrounds. The ROC curve depicts the mean and standard deviation of three independent experiments (searches against three independently genrerated backgrounds). The plot is scaled to encompass a 1/10 true to false positive ratio (2100 false positives) in the equiprobable background. RE' results, which completely overlap RE · BvH ones, are not shown for clarity. The RE profiles for CRP against the different backgrounds are shown in the bottom-right inset.
Figure 3
Figure 3
Search efficiency for E. coli Fur sites in a skewed random background. ROC curves for search methods trying to locate 45 Fur binding sites on randomly generated backgrounds. The ROC curve depicts the mean and standard deviation of three independent experiments (searches against three independently genrerated backgrounds). The plot is scaled to encompass a 1/10 true to false positive ratio (450 false positives) in the equiprobable background. RE' results, which completely overlap RE · BvH ones, are not shown for clarity. The RE profiles for Fur against the different backgrounds are shown in the bottom-right inset.
Figure 4
Figure 4
Search efficiency for Fur sites in E. coli and P. aeruginosa. ROC curves for search methods trying to locate P. aeruginosa and E. coli Fur binding sites on, respectively, P. aeruginosa and E. coli genomes. Abbreviations: Eco – E. coli, Hin – H. influenzae. The plot is scaled to encompass a 1/10 true to false positive ratio (320 false positives) in P. aeruginosa.
Figure 5
Figure 5
Search efficiency for CRP sites in E. coli and H. influenzae. ROC curves for search methods trying to locate H. influenzae and E. coli CRP binding sites on, respectively, H. influenzae and E. coli genomes. Abbreviations: Eco – E. coli, Hin – H. influenzae. The plot is scaled to encompass a 1/10 true to false positive ratio (450 false positives) in H. influenzae.
Figure 6
Figure 6
Information profile for P. aeruginosa Fur and H. influenzae CRP motifs. (A) Rsequence and RE profiles for Fur on the P. aeruginosa genome. (B) Rsequence and RE profiles for CRP on the H. influenzae genome, and for the mean Rsequence profile obtained from 10,000 45-site subsamples of the 210 E. coli binding sites. Vertical bars show the standard deviation.
Figure 7
Figure 7
Observed vs. expected frequency of 20-mers in genomes. Mean ratio between observed and expected 20-mers in real genomes versus randomly generated sequences. Ratios were computed independently for 3 different genomes and 3 random sequences of similar %GC composition. Vertical bars show the standard deviation of these ratios. Genomes used for calculations: E. coli str. K-12 substr. MG1655 [50.8% GC], P. aeruginosa PAO1 [66.6% GC], H. influenzae Rd KW20 [38.1% GC], Colwellia psychrerythraea 34H [38.0% GC], Salinibacter ruber DSM 13855 [66.2% GC], Thiobacillus denitrificans ATCC 25259 [66.1% GC], Enterococcus faecalis V583 [37.5% GC], Anaplasma marginale str. St. Maries [49.8% GC] and Nitrosococcus oceani ATCC 19707 [50.3% GC].
Figure 8
Figure 8
Standard and effective affinity range for different transcription factors. (a) Estimation of the affinity range for the different transcription factors analyzed in this work. For each transcription factor, the affinity range is represented as the distribution of affinities for all its experimentally determined binding sites. The affinity of each binding site is estimated using the Rsequence · BvH ranking index. (b) Estimation of the effective affinity range. For each transcription factor, the effective affinity range is represented as the distribution of normalized affinities for all its experimentally determined binding sites. Normalized affinities are estimated by normalizing the Rsequence · BvH ranking index for each site with the number of false positives required to find that site. For comparison purposes, in both affinity range plots Rsequence · BvH values (Y-axis) are normalized to the length of the binding motif for each transcription factor and ranges (X-axis) are shown as the percentage of experimentally determined sites (collection).
Figure 9
Figure 9
Search efficiency in E. coli with "weakened" CRP sites. Mean ROC curves for the Ri search method trying to locate CRP binding sites on the E. coli genome, using the original, asymmetric and mirrored collections of CRP. The plot is scaled to encompass a 1/10 true to false positive ratio for CRP (2100 false positives). The Rsequence profile of the original, asymmetrical and mirrored CRP motifs is shown in the inset.

Similar articles

Cited by

References

    1. Aparicio O, Geisberg JV, Struhl K. Chromatin immunoprecipitation for determining the association of proteins with specific genomic sequences in vivo. Current protocols in cell biology/editorial board, Juan S Bonifacino [et al] 2004;Chapter 17:Unit 17.17. - PubMed
    1. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. - PubMed
    1. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf on Intell Syst Mol Biol. 1994;2:28–36. - PubMed
    1. Stormo GD, Hartzell GW., 3rd Identifying protein-binding sites from unaligned DNA fragments. Proceedings of the National Academy of Sciences of the United States of America. 1989;86:1183–1187. - PMC - PubMed
    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed

Publication types

LinkOut - more resources