Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Apr 14;112(15):4654-9.
doi: 10.1073/pnas.1422023112. Epub 2015 Mar 9.

Quantitative modeling of transcription factor binding specificities using DNA shape

Affiliations

Quantitative modeling of transcription factor binding specificities using DNA shape

Tianyin Zhou et al. Proc Natl Acad Sci U S A. .

Abstract

DNA binding specificities of transcription factors (TFs) are a key component of gene regulatory processes. Underlying mechanisms that explain the highly specific binding of TFs to their genomic target sites are poorly understood. A better understanding of TF-DNA binding requires the ability to quantitatively model TF binding to accessible DNA as its basic step, before additional in vivo components can be considered. Traditionally, these models were built based on nucleotide sequence. Here, we integrated 3D DNA shape information derived with a high-throughput approach into the modeling of TF binding specificities. Using support vector regression, we trained quantitative models of TF binding specificity based on protein binding microarray (PBM) data for 68 mammalian TFs. The evaluation of our models included cross-validation on specific PBM array designs, testing across different PBM array designs, and using PBM-trained models to predict relative binding affinities derived from in vitro selection combined with deep sequencing (SELEX-seq). Our results showed that shape-augmented models compared favorably to sequence-based models. Although both k-mer and DNA shape features can encode interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space. In addition, analyzing the feature weights of DNA shape-augmented models uncovered TF family-specific structural readout mechanisms that were not revealed by the DNA sequence. As such, this work combines knowledge from structural biology and genomics, and suggests a new path toward understanding TF binding and genome function.

Keywords: DNA structure; protein binding microarray; protein−DNA recognition; statistical machine learning; support vector regression.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Design of the sequence+shape feature vector, and TF family-specific performance comparison of binding specificity predictions. (A) The feature vector used in the 1mer+shape model combined binary features for the sequence (1-mers) with continuous values for the DNA shape features (MGW, ProT, Roll, and HelT). In addition, second-order shape features were also used throughout this study (see SI Methods for details). (B) Performance comparison for different TF families tested in this study. DNA shape contributed to the DNA binding specificities of all homeodomain and bHLH TFs in the uPBM, gcPBM, and SELEX-seq datasets, consistent with previous work on these TF families (9, 22, 30, 31, 34).
Fig. 2.
Fig. 2.
Performance of various models on uPBM data for 65 mouse TFs. (A and B) Using R2 as a measure for prediction accuracy, the performance of the shape-augmented model (1mer+shape) was compared with the performances of sequence-based (A) 1mer+2mer and (B) 1mer+2mer+3mer models. (C) Performance comparison of the shape-only model to the sequence-only (1mer) model. (D) Performance comparison of the shape-only model to a model augmented by randomized shape features.
Fig. 3.
Fig. 3.
Performance of binding specificity models across experimental platforms. (A) Flowchart illustrating that Max−DNA binding specificity models were trained on natural logarithms of fluorescence binding intensities measured by a gcPBM experiment and used to predict the binding level of DNA targets derived from a SELEX-seq experiment for the same TF. (B) Performance of various models for cross-platform predictions based on Spearman’s rank correlation coefficients between observed SELEX-seq relative binding affinities and predicted gcPBM signal intensities. The P values indicate that the improvement in prediction accuracy using the 1mer+shape model is significant compared with the sequence-based models. (C and D) Scatter plots of predicted versus observed binding site ranks, showing the performance of the (C) 1mer and (D) 1mer+shape models trained on gcPBM data and tested on SELEX-seq data. Here, higher ranks represent higher-affinity binding sites.
Fig. 4.
Fig. 4.
Comparison of various models using gcPBM data for human Max TF. (A) The number of required features (per nucleotide position) was correlated with the average running time for the training and testing of different models for Max−DNA binding, based on gcPBM data. (B) Performances of the sequence- and shape-based models for Max−DNA binding as the sample size was decreased.
Fig. 5.
Fig. 5.
Insights into TF-specific readout mechanisms derived from shape-augmented binding specificity models. (A) Number of features per nucleotide position introduced in different models. Models that included only one shape feature further reduced the total number of features compared with the 1mer+2mer+3mer and 1mer+shape models. (B) The single-shape-feature models 1mer+ProT and 1mer+Roll performed better than the 1mer+shape and 1mer+2mer+3mer models on smaller datasets. (C) Feature weights for Roll (heat map) derived from the 1mer+first order shape model using SVR accurately reflected the Roll characteristics (plot) observed in the cocrystal structure of the ternary Max homodimer/DNA complex [Protein Data Bank identifier (PDB ID) 1HLO] (32). (D) The CACGTG E-box in the cocrystal structure was the highest-affinity core among the nine observed E-box cores. Although other cores were present in the gcPBM data for Max, the SVR feature weights correctly reflected the Roll features of the CACGTG core observed in the cocrystal structure (32).

Comment in

  • Shapely DNA attracts the right partner.
    Przytycka TM, Levens D. Przytycka TM, et al. Proc Natl Acad Sci U S A. 2015 Apr 14;112(15):4516-7. doi: 10.1073/pnas.1503951112. Epub 2015 Apr 6. Proc Natl Acad Sci U S A. 2015. PMID: 25848058 Free PMC article. No abstract available.

Similar articles

Cited by

References

    1. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From properties to genome-wide predictions. Nat Rev Genet. 2014;15(4):272–286. - PubMed
    1. Levo M, Segal E. In pursuit of design principles of regulatory sequences. Nat Rev Genet. 2014;15(7):453–468. - PubMed
    1. Slattery M, et al. Absence of a simple code: How transcription factors read the genome. Trends Biochem Sci. 2014;39(9):381–399. - PMC - PubMed
    1. Stormo GD, Zhao Y. Determining the specificity of protein-DNA interactions. Nat Rev Genet. 2010;11(11):751–760. - PubMed
    1. Stormo GD. Quantitative Biology. Vol 1. Springer; Berlin: 2013. Modeling the specificity of protein-DNA interactions; pp. 115–130. - PMC - PubMed

Publication types

Associated data