Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Randomized Controlled Trial
. 2013;14 Suppl 10(Suppl 10):S2.
doi: 10.1186/1471-2105-14-S10-S2. Epub 2013 Aug 12.

Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

Randomized Controlled Trial

Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data

Juhani Kähärä et al. BMC Bioinformatics. 2013.

Abstract

Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Classification accuracies of the k-mer model and PWM models. The test set classification accuracy is plotted for 3 different k-mer model approaches and the PWM model. The 95% normal approximation confidence intervals are plotted on top of each bar.
Figure 2
Figure 2
Classification accuracy as a function of number of k-mers in the model. The test set classification accuracy plotted as a function of number of k-mers in the model. The 95% normal approximation confidence intervals are plotted around each curve as grey area.
Figure 3
Figure 3
Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mers. The 95% normal approximation confidence intervals are plotted around the curves.
Figure 4
Figure 4
Illustration of classification accuracy using the CV-scheme when starting with greater number of k-mers. The test set classification accuracy as a function of number of k-mers, when the k-mers are chosen with the CV-scheme. The CV is started from 450 and 600 most frequent k-mers for proteins HSF2 and FOXJ3. The 95% normal approximation confidence intervals are plotted around the curves.
Figure 5
Figure 5
Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enrichment cycles for five proteins. The 95% normal approximation confidence intervals are plotted around the accuracies.
Figure 6
Figure 6
Top k-mers aligned to FOXJ3 logo. The top k-mers chosen with the CV-scheme (left) and the top affinity k-mers from the most frequent approach (right) aligned to FOXJ3 sequence logo.
Figure 7
Figure 7
Top k-mers aligned to HSF2 logo. The top k-mers chosen with the CV-scheme (left) and the top affinity k-mers from the most frequent approach (right) aligned to HSF2 sequence logo.
Figure 8
Figure 8
AUC as a function of number of k-mers in the model for two NFATC1 ChIP-seq samples. (left) AUCs when the true binding sites are taken to be within 100 nucleotides around the summit of the ChIP-seq peak. (right) The same as (left) except each binding site is taken to be the whole ChIP-seq peak region. The 95% Mann-Whitney confidence intervals plotted around the curves.
Figure 9
Figure 9
AUC as a function of number of k-mers in the model for GATA1 ChIP-seq samples. AUC as a function of number of k-mers for GATA1, when k-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent k-mers. In right figure the AUC is calculated when only the centers of the ChIP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves.

Similar articles

Cited by

References

    1. Wasserman W, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nature Reviews Genetics. 2004;5(4):276–87. doi: 10.1038/nrg1315. - DOI - PubMed
    1. Weirauch M, Cote A, Norel R, Annala M, Zhao Y, Riley T, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S. DREAM5 Consortium. Bussemaker H, Morris Q, Bulyk M, Stolovitzky G, Hughes T. Evaluation of methods for modeling transcription factor sequence specificity. Nature Biotechnology. 2013;2:126–34. - PMC - PubMed
    1. Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research. 1984;12:505–519. doi: 10.1093/nar/12.1Part2.505. - DOI - PMC - PubMed
    1. Stormo G, Schneider T, Gold L, Ehrenfeucht A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982;10:2997–3011. doi: 10.1093/nar/10.9.2997. - DOI - PMC - PubMed
    1. Bailey T, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. 1994;2:28–36. - PubMed

Publication types

MeSH terms

LinkOut - more resources