Skip to main page content
U.S. flag

An official website of the United States government

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 7(Suppl 7):S13.
doi: 10.1186/1471-2164-16-S7-S13. Epub 2015 Jun 11.

MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures

MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures

Yizhe Zhang et al. BMC Genomics. 2015.

Abstract

Background: Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically.

Result: Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods.

Conclusion: MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The pipeline of MOST+ system. A set of target genomic sequences are extracted from a genome then indexed by a suffix tree to count occurrence of each word (or K-mer). If under MOST+ mode, histone modification marks and/or DNase I hypersensitivity (referred as tag signals in this schema or mark distribution) of each word are used to yield mark distribution scores. Top ranked words are put into clustering and motifs are generated from the resulted clusters. The strategy for clustering is illustrated on the right panel of this figure.
Figure 2
Figure 2
Comparison of different motif finding methods. X-axis is the running time in logarithmic scale while the Y-axis is the total size (Mbps) of input sequences.
Figure 3
Figure 3
Distributions of several highly enriched word instances found in CTCF and ESrrb's ChIP-seq dataset: (A). The upper 3 figures are from CTCF dataset. Spurious words show irregular or flat patterns (CTCF word "CTGCCCTCT" versus repeat words: "CTCTCTCTC", "TTTTAAAAA". All three words have odds ratio scores ranging from 3.4 to 4, i.e. in the same level of over-representative ratio), indicating one can make use of tag signals to discriminate motif words from their background. (B). The lower 3 figures are from ESrrb dataset. Distributions of word from Esrrb motif ("CCAAGGTCA" and "CAGAGGTCA", both contains core 'AGGTCA') strongly resemble to each other, while MYF motif component word (lower right corner: "CGGGAGGGG") shows a distinct pattern in distribution (dotted lines show distributions smoothed by a DFT with the top 5/8 higher frequency components removed).
Figure 4
Figure 4
The diagram of pipeline for parameter optimization and method comparison. A motif-finding step is followed by a TFBS identification step (by CisFinder) using motifs and genomic sequences as input. Training data (8 of 10 folds) are fed into motif finding tools, and then accuracy is evaluated based on how well the motifs recovered can pinpoint TFBSs. AUROC is used to represent the accuracy of each method.
Figure 5
Figure 5
Comparison of site-level accuracy for different methods. AUROC of each method on recovering motifs for different essayed TFs were shown in the figure.
Figure 6
Figure 6
(A). DNase I hypersensitivity signal shows evident cleavage pattern around ChIP-seq peaks in human LCL datasets. (B). With the help of DNase I hypersensitivity signals, additional motifs were found by MOST+ in VDR datasets). Some long motifs are similar to those reported in Xie et al. (2005,2007) [2]36.
Figure 7
Figure 7
Motifs discovered by MOST+ in all promoters of mouse genome. Left panel: A motif discovered by MOST+ that resembles GABPA motif in JASPAR. Right panel: Examples of some unknown motifs with obvious kurtosis pattern in histone modification distributions.

Similar articles

Cited by

References

    1. Kim JT, Martinetz TT, Polani DD. Bioinformatic principles underlying the information content of transcription factor binding sites. Journal of Theoretical Biology. 2003;220(4):529–544. doi: 10.1006/jtbi.2003.3153. - DOI - PubMed
    1. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. - DOI - PubMed
    1. Sandve GK, Abul O, Walseng V, Drablos F. Improved benchmarks for computational motif discovery. BMC Bioinformatics. 2007;8:193. doi: 10.1186/1471-2105-8-193. - DOI - PMC - PubMed
    1. Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21(3):447–455. doi: 10.1101/gr.112623.110. - DOI - PMC - PubMed
    1. Bailey TL, Elkan C. Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. University of California San Diego. Dept.of Computer Science and Engineering; 1993.

Publication types

LinkOut - more resources