Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jul;38(12):e135.
doi: 10.1093/nar/gkq274. Epub 2010 May 3.

Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction

Affiliations

Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction

Rafik A Salama et al. Nucleic Acids Res. 2010 Jul.

Abstract

Prediction of transcription factor binding sites is an important challenge in genome analysis. The advent of next generation genome sequencing technologies makes the development of effective computational approaches particularly imperative. We have developed a novel training-based methodology intended for prokaryotic transcription factor binding site prediction. Our methodology extends existing models by taking into account base interdependencies between neighbouring positions using conditional probabilities and includes genomic background weighting. This has been tested against other existing and novel methodologies including position-specific weight matrices, first-order Hidden Markov Models and joint probability models. We have also tested the use of gapped and ungapped alignments and the inclusion or exclusion of background weighting. We show that our best method enhances binding site prediction for all of the 22 Escherichia coli transcription factors with at least 20 known binding sites, with many showing substantial improvements. We highlight the advantage of using block alignments of binding sites over gapped alignments to capture neighbouring position interdependencies. We also show that combining these methods with ChIP-on-chip data has the potential to further improve binding site prediction. Finally we have developed the ungapped likelihood under positional background platform: a user friendly website that gives access to the prediction method devised in this work.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
First-order HMM states for the DNA sequence, with four match states {A, C, G, T} emitting A, C, G or T, respectively, with probability 1. D is the delete state/silent state emitting no bases and I is the insert state which emits either A, C, G or T with equal probability. B and E denotes the beginning and end states of the HMM.
Figure 2.
Figure 2.
Heat maps and sequence logos (35) of the three binding sites under study showing mutual information between bases. Darker squares indicate higher mutual information. (A) shows the heat map of CRP, (B) Sequence logo for CRP, (C) shows the heat map of LexA, (D) Sequence logo for LexA, (E) shows the heat map of ArcA and (F) Sequence logo for ArcA. For all three genes, there are high levels of mutual information between many neighboring bases, as well as longer range interactions. Mutual information on the minor diagonal represents palindromic correlations.
Figure 3.
Figure 3.
ROC curves for the binding sites being studied. (A) CRP, (B) LexA, (C) ArcA and (D) Figure legend. Each plot shows a comparison between Green: the ULPB, Blue: the gapped alignment scoring using Viterbi algorithm, Red: un-gapped alignment using the conditional probability, Purple: normal PSWM scoring and Grey: un-gapped joint probability. Observe that in all cases our novel ungapped method either outperforms or matches the level of all other methods.
Figure 4.
Figure 4.
(A) ChIP-on-chip analysis of CRP linked with the whole genome showing probes corresponding to known binding sites as blue dots and other probes on the chip in brown. The horizontal line is shows the optimal signal cut-off and the vertical line shows the optimal likelihood cut-off. (B) ChIP-on-chip analysis of LexA linked with the whole genome; details as in (A).
Figure 5.
Figure 5.
ULPB website passes through three stages in its process of the TFBS search. The first stage starts by computing the likelihood for the training sequences using ULPB. The second stage, a background model is generated from the search sequences and is used as a null hypothesis. The third stage determines the cut-off for the transcription factor likelihood as 5% of the background sequences, and then it scores the given search sequences and outputs the binding sites >5%.
Figure 6.
Figure 6.
Clustalw MSAs of ArcA and LexA. The ArcA alignment has many gaps, especially at the start of the sequences. The LexA alignment has few gaps.

Similar articles

Cited by

References

    1. Leblanc B, Moss T. DNase I footprinting. Methods Mol. Biol. 2001;148:31–38. - PubMed
    1. Aparicio O, Geisberg JV, Sekinger E, Yang A, Moqtaderi Z, Struhl K. Chromatin immunoprecipitation for determining the association of proteins with specific genomic sequences in vivo. Curr. Protoc. Mol. Biol. 2005;Chapter 21 Unit 21, 23. - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein–DNA interactions. Science. 2007;316:1497–1502. - PubMed
    1. Hall N. Advanced sequencing technologies and their wider impact in microbiology. J. Exp. Biol. 2007;210:1518–1525. - PubMed
    1. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 2005;23:137–144. - PubMed

Publication types