Combining evidence using p-values: application to sequence homology searches
- PMID: 9520501
- DOI: 10.1093/bioinformatics/14.1.48
Combining evidence using p-values: application to sequence homology searches
Abstract
Motivation: To illustrate an intuitive and statistically valid method for combining independent sources of evidence that yields a p-value for the complete evidence, and to apply it to the problem of detecting simultaneous matches to multiple patterns in sequence homology searches.
Results: In sequence analysis, two or more (approximately) independent measures of the membership of a sequence (or sequence region) in some class are often available. We would like to estimate the likelihood of the sequence being a member of the class in view of all the available evidence. An example is estimating the significance of the observed match of a macromolecular sequence (DNA or protein) to a set of patterns (motifs) that characterize a biological sequence family. An intuitive way to do this is to express each piece of evidence as a p-value, and then use the product of these p-values as the measure of membership in the family. We derive a formula and algorithm (QFAST) for calculating the statistical distribution of the product of n independent p-values. We demonstrate that sorting sequences by this p-value effectively combines the information present in multiple motifs, leading to highly accurate and sensitive sequence homology searches.
Comment in
-
Concerning the accuracy of MAST E-values.Bioinformatics. 2000 May;16(5):488-9. doi: 10.1093/bioinformatics/16.5.488. Bioinformatics. 2000. PMID: 10871274 No abstract available.
Similar articles
-
A test for the statistical significance of DNA sequence similarities for application in databank searches.Comput Appl Biosci. 1989 Apr;5(2):123-31. doi: 10.1093/bioinformatics/5.2.123. Comput Appl Biosci. 1989. PMID: 2720462
-
Methods and statistics for combining motif match scores.J Comput Biol. 1998 Summer;5(2):211-21. doi: 10.1089/cmb.1998.5.211. J Comput Biol. 1998. PMID: 9672829
-
Matching among multiple random sequences.Bull Math Biol. 1997 May;59(3):483-96. doi: 10.1007/BF02459461. Bull Math Biol. 1997. PMID: 9172825
-
Score distributions for simultaneous matching to multiple motifs.J Comput Biol. 1997 Spring;4(1):45-59. doi: 10.1089/cmb.1997.4.45. J Comput Biol. 1997. PMID: 9109037
-
Estimating statistical significance of sequence alignments.Philos Trans R Soc Lond B Biol Sci. 1994 Jun 29;344(1310):383-90. doi: 10.1098/rstb.1994.0077. Philos Trans R Soc Lond B Biol Sci. 1994. PMID: 7800708
Cited by
-
Evaluation and integration of existing methods for computational prediction of allergens.BMC Bioinformatics. 2013;14 Suppl 4(Suppl 4):S1. doi: 10.1186/1471-2105-14-S4-S1. Epub 2013 Mar 8. BMC Bioinformatics. 2013. PMID: 23514097 Free PMC article.
-
MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding.BMC Bioinformatics. 2015 Jan 31;16:30. doi: 10.1186/s12859-014-0446-3. BMC Bioinformatics. 2015. PMID: 25637281 Free PMC article.
-
Two different domain architectures generate structural and functional diversity among bZIP genes in the Solanaceae family.Front Plant Sci. 2022 Aug 19;13:967546. doi: 10.3389/fpls.2022.967546. eCollection 2022. Front Plant Sci. 2022. PMID: 36061789 Free PMC article.
-
Combinatorial motif analysis of regulatory gene expression in Mafb deficient macrophages.BMC Syst Biol. 2011;5 Suppl 2(Suppl 2):S7. doi: 10.1186/1752-0509-5-S2-S7. Epub 2011 Dec 14. BMC Syst Biol. 2011. PMID: 22784578 Free PMC article.
-
Characterization of the Newly Isolated Lytic Bacteriophages KTN6 and KT28 and Their Efficacy against Pseudomonas aeruginosa Biofilm.PLoS One. 2015 May 21;10(5):e0127603. doi: 10.1371/journal.pone.0127603. eCollection 2015. PLoS One. 2015. PMID: 25996839 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources