Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct;216(2):353-358.
doi: 10.1534/genetics.120.303597. Epub 2020 Aug 19.

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Affiliations

Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Mengchi Wang et al. Genetics. 2020 Oct.

Abstract

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener's method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.

Keywords: consensus; information theory; motif; sequence logo; transcription factor binding.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of sequence Motto and comparison with sequence logo. Given a motif PWM as the input, Motto outputs a consensus that minimizes information loss. Here we show how the sequence Motto of the human transcription factor P73 is determined.
Figure 2
Figure 2
Example usage using human CTCF (upper panel) and lipoprotein binding sites from Bailey and Elkan (1994) (lower panel). The original PWM is shown in a sequence logo. Different Motto options resulted in various consensus sequence output at each position. In particular, “-m/–method” specifies the method: Motto (default), MSE (minimal mean square error), Cavener (Cavener 1987), or Max (using maximal frequency at each position); “-s/–style” specifies the output style: IUPAC (Johnson 2010) (single character for nucleotide combinations), regex (regular expression), or compact (convert [ACGT] to N in regex); “-t/–trim” is an option for trimming off the flanking Ns; “-p/–penalty” specifies a weight between 0 and 1 that penalizes ambiguity at each position (for details see Materials and Methods).
Figure 3
Figure 3
Converted sequence Mottos recapitulate motif occurrence sites of 1156 common human and mouse transcription factors (TFs) in the human genome (hg19). (A) The averaged area under the precision-recall curve (auPRC) using Motto (default method with minimal JSD, ambiguity penalty at -P = 0.2, and at -P = 0.5) compared with existing alternative methods. P-value determined by paired t-test. (B) Comparison in three examples TFs showing the differences of consensus sequences [shown in IUPAC (Johnson 2010) coding for better alignment] and performances.

Similar articles

Cited by

References

    1. Amanchy R., Kandasamy K., Mathivanan S., Periaswamy B., Reddy R. et al. , 2011. Identification of novel phosphorylation motifs through an integrative computational and experimental analysis of the human phosphoproteome. J. Proteomics Bioinform. 4: 22–35. 10.4172/jpb.1000163 - DOI - PMC - PubMed
    1. Bailey T. L., and Elkan C., 1994. Fitting a mixture model by expectation maximization to discover motifs in bipolymers. Proc Int Conf Intell Syst Mol Biol. 2: 28–36 - PubMed
    1. Bailey T. L., Boden M., Buske F. A., Frith M., Grant C. E. et al. , 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37: W202–W208. 10.1093/nar/gkp335 - DOI - PMC - PubMed
    1. Cavener D. R., 1987. Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 15: 1353–1361. 10.1093/nar/15.4.1353 - DOI - PMC - PubMed
    1. Davis, J., and M. Goadrich, 2006 The Relationship Between Precision-Recall and ROC Curves, pp. 233–240 in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06. ACM, New York. 10.1145/1143844.114387410.1145/1143844.1143874 - DOI

Publication types

Substances