Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
- PMID: 11452024
- PMCID: PMC55814
- DOI: 10.1093/nar/29.14.2994
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
Abstract
PSI-BLAST is an iterative program to search a database for proteins with distant similarity to a query sequence. We investigated over a dozen modifications to the methods used in PSI-BLAST, with the goal of improving accuracy in finding true positive matches. To evaluate performance we used a set of 103 queries for which the true positives in yeast had been annotated by human experts, and a popular measure of retrieval accuracy (ROC) that can be normalized to take on values between 0 (worst) and 1 (best). The modifications we consider novel improve the ROC score from 0.758 +/- 0.005 to 0.895 +/- 0.003. This does not include the benefits from four modifications we included in the 'baseline' version, even though they were not implemented in PSI-BLAST version 2.0. The improvement in accuracy was confirmed on a small second test set. This test involved analyzing three protein families with curated lists of true positives from the non-redundant protein database. The modification that accounts for the majority of the improvement is the use, for each database sequence, of a position-specific scoring system tuned to that sequence's amino acid composition. The use of composition-based statistics is particularly beneficial for large-scale automated applications of PSI-BLAST.
Figures
Similar articles
-
IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices.Bioinformatics. 1999 Dec;15(12):1000-11. doi: 10.1093/bioinformatics/15.12.1000. Bioinformatics. 1999. PMID: 10745990
-
Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches.Bioinformatics. 2008 Jun 1;24(11):1339-43. doi: 10.1093/bioinformatics/btn130. Epub 2008 Apr 10. Bioinformatics. 2008. PMID: 18403442
-
Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.BMC Biol. 2006 Dec 7;4:41. doi: 10.1186/1741-7007-4-41. BMC Biol. 2006. PMID: 17156431 Free PMC article.
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389. Nucleic Acids Res. 1997. PMID: 9254694 Free PMC article. Review.
-
Identifying remote protein homologs by network propagation.FEBS J. 2005 Oct;272(20):5119-28. doi: 10.1111/j.1742-4658.2005.04947.x. FEBS J. 2005. PMID: 16218946 Review.
Cited by
-
Widespread divergence of the CEACAM/PSG genes in vertebrates and humans suggests sensitivity to selection.PLoS One. 2013 Apr 16;8(4):e61701. doi: 10.1371/journal.pone.0061701. Print 2013. PLoS One. 2013. PMID: 23613906 Free PMC article.
-
Escherichia coli ST8196 is a novel, locally evolved, and extensively drug resistant pathogenic lineage within the ST131 clonal complex.Emerg Microbes Infect. 2020 Dec;9(1):1780-1792. doi: 10.1080/22221751.2020.1797541. Emerg Microbes Infect. 2020. PMID: 32686595 Free PMC article.
-
The Double-Stranded DNA Virosphere as a Modular Hierarchical Network of Gene Sharing.mBio. 2016 Aug 2;7(4):e00978-16. doi: 10.1128/mBio.00978-16. mBio. 2016. PMID: 27486193 Free PMC article.
-
In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets.Molecules. 2022 Oct 20;27(20):7103. doi: 10.3390/molecules27207103. Molecules. 2022. PMID: 36296697 Free PMC article. Review.
-
Predicting Protein-Protein Interaction Sites Using Sequence Descriptors and Site Propensity of Neighboring Amino Acids.Int J Mol Sci. 2016 Oct 26;17(11):1788. doi: 10.3390/ijms17111788. Int J Mol Sci. 2016. PMID: 27792167 Free PMC article.
References
-
- Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
-
- Schäffer A.A., Wolf,Y.I., Ponting,C.P., Koonin,E.V., Aravind,L. and Altschul,S.F. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics, 15, 1000–1011. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Research Materials