Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Feb;31(2):126-34.
doi: 10.1038/nbt.2486. Epub 2013 Jan 27.

Evaluation of methods for modeling transcription factor sequence specificity

Collaborators, Affiliations

Evaluation of methods for modeling transcription factor sequence specificity

Matthew T Weirauch et al. Nat Biotechnol. 2013 Feb.

Abstract

Genomic analyses often involve scanning for potential transcription factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's DNA-binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For nine TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro-derived motifs performed similarly to motifs derived from the in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices trained by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10% of the TFs examined here). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Evaluation criteria used in this study. For each TF, we scored an algorithm’s probe intensity predictions using two evaluation criteria, which are illustrated here for TF_16 (Prdm11), using the predictions of BEEML-PBM on the raw array intensity data. (a) Pearson correlation between predicted and actual probe intensities across all ~40,000 probes. (b) Area under the receiver operating characteristic curve (AUROC) of the set of positive probes. Positive probes (black dashed lines) were defined as all probes on the test array with intensities greater than four standard deviations above the mean probe intensity for the given array.
Figure 2
Figure 2
Comparison of algorithm performance by transcription factor. (a) Final score of each algorithm for each TF. TF name, ID and family are depicted across the columns, and sequence specificity model type and name are depicted across the rows. Color scale is indicated at the upper right. Algorithms are sorted in decreasing order of final performance across all TFs. TFs are sorted in decreasing order of mean final score across all algorithms. (b) Summary statistics for each TF across all algorithms: mean final score, maximum final score achieved by any k-mer, dinucleotide or PWM-based algorithm, Pearson correlation of 8-mer Z-scores between replicate arrays, and the number of 8-mers with E-scores > 0.45 on the training array (normalized by the maximum such value across all TFs). (c) Difference between the best score achieved by any k-mer based algorithm and the best score achieved by any PWM-based algorithm for each TF.
Figure 3
Figure 3
Comparison of algorithm performance on in vivo data. For each algorithm, we learned a model (PMW, k-mer or dinucleotide) using PBM data, and gauged its ability to discriminate real from random ChIP peaks using the AUROC (Online Methods). Data for the first five TFs were taken from mouse ChIP-seq data. The final four are from yeast ChIP-exo data. The color scale is indicated at the bottom. Team_E was not run on the ChIP-exo data, because it requires initialization parameters specific to the individual TF. FeatureREDUCE was run using models of length 8, instead of length 10, owing to the superior performance of this length model on in vivo data (T.R. Riley and H.J. Bussemaker, manuscript in preparation).
Figure 4
Figure 4
Characteristics of Klf9 motifs produced by the eight PWM-based algorithms evaluated in this study. The algorithms are ranked top to bottom in order of the overall score of their PWM for this TF in our evaluation scheme. Two popular visualization methods of the PWMs produced by each algorithm are depicted: on the left are traditional sequence logos, , which display the information content of each nucleotide at each position; the total information content of the PWM is given to the left of this logo. On the right are frequency logos, in which the height of each nucleotide corresponds to its frequency of occurrence at the given position.

Similar articles

Cited by

References

    1. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982;10:2997–3011. - PMC - PubMed
    1. Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987;193:723–750. - PubMed
    1. Stormo GD. Consensus patterns in DNA. Methods Enzymol. 1990;183:211–221. - PubMed
    1. Siddharthan R. Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix. PLoS One. 2011;5:e9722. - PMC - PubMed
    1. Zhao X, Huang H, Speed TP. Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005;12:894–906. - PubMed

Publication types

Associated data