Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases
- PMID: 11159310
- DOI: 10.1093/bioinformatics/16.11.988
Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases
Abstract
Motivation: Sequence alignment techniques have been developed into extremely powerful tools for identifying the folding families and function of proteins in newly sequenced genomes. For a sufficiently low sequence identity it is necessary to incorporate additional structural information to positively detect homologous proteins. We have carried out an extensive analysis of the effectiveness of incorporating secondary structure information directly into the alignments for fold recognition and identification of distant protein homologs. A secondary structure similarity matrix based on a database of three-dimensionally aligned proteins was first constructed. An iterative application of dynamic programming was used which incorporates linear combinations of amino acid and secondary structure sequence similarity scores. Initially, only primary sequence information is used. Subsequently contributions from secondary structure are phased in and new homologous proteins are positively identified if their scores are consistent with the predetermined error rate.
Results: We used the SCOP40 database, where only PDB sequences that have 40% homology or less are included, to calibrate homology detection by the combined amino acid and secondary structure sequence alignments. Combining predicted secondary structure with sequence information results in a 8-15% increase in homology detection within SCOP40 relative to the pairwise alignments using only amino acid sequence data at an error rate of 0.01 errors per query; a 35% increase is observed when the actual secondary structure sequences are used. Incorporating predicted secondary structure information in the analysis of six small genomes yields an improvement in the homology detection of approximately 20% over SSEARCH pairwise alignments, but no improvement in the total number of homologs detected over PSI-BLAST, at an error rate of 0.01 errors per query. However, because the pairwise alignments based on combinations of amino acid and secondary structure similarity are different from those produced by PSI-BLAST and the error rates can be calibrated, it is possible to combine the results of both searches. An additional 25% relative improvement in the number of genes identified at an error rate of 0.01 is observed when the data is pooled in this way. Similarly for the SCOP40 dataset, PSI-BLAST detected 15% of all possible homologs, whereas the pooled results increased the total number of homologs detected to 19%. These results are compared with recent reports of homology detection using sequence profiling methods.
Availability: Secondary structure alignment homepage at http://lutece.rutgers.edu/ssas
Contact: anders@rutchem.rutgers.edu; ronlevy@lutece.rutgers.edu
Supplementary information: Genome sequence/structure alignment results at http://lutece.rutgers.edu/ss_fold_predictions.
Similar articles
-
Large-scale comparison of protein sequence alignment algorithms with structure alignments.Proteins. 2000 Jul 1;40(1):6-22. doi: 10.1002/(sici)1097-0134(20000701)40:1<6::aid-prot30>3.0.co;2-7. Proteins. 2000. PMID: 10813826
-
Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation.J Mol Recognit. 2005 Mar-Apr;18(2):139-49. doi: 10.1002/jmr.721. J Mol Recognit. 2005. PMID: 15558595
-
NdPASA: a novel pairwise protein sequence alignment algorithm that incorporates neighbor-dependent amino acid propensities.Proteins. 2005 Feb 15;58(3):628-37. doi: 10.1002/prot.20359. Proteins. 2005. PMID: 15616964
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. doi: 10.1093/nar/25.17.3389. Nucleic Acids Res. 1997. PMID: 9254694 Free PMC article. Review.
-
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.Nucleic Acids Res. 2001 Jul 15;29(14):2994-3005. doi: 10.1093/nar/29.14.2994. Nucleic Acids Res. 2001. PMID: 11452024 Free PMC article. Review.
Cited by
-
Improved detection of remote homologues using cascade PSI-BLAST: influence of neighbouring protein families on sequence coverage.PLoS One. 2013;8(2):e56449. doi: 10.1371/journal.pone.0056449. Epub 2013 Feb 20. PLoS One. 2013. PMID: 23437136 Free PMC article.
-
Common Functions of Disordered Proteins across Evolutionary Distant Organisms.Int J Mol Sci. 2020 Mar 19;21(6):2105. doi: 10.3390/ijms21062105. Int J Mol Sci. 2020. PMID: 32204351 Free PMC article. Review.
-
A method for prediction of the locations of linker regions within large multifunctional proteins, and application to a type I polyketide synthase.J Mol Biol. 2002 Oct 25;323(3):585-98. doi: 10.1016/s0022-2836(02)00972-5. J Mol Biol. 2002. PMID: 12381311 Free PMC article.
-
Physiological Analysis and Genetic Mapping of Short Hypocotyl Trait in Brassica napus L.Int J Mol Sci. 2023 Oct 21;24(20):15409. doi: 10.3390/ijms242015409. Int J Mol Sci. 2023. PMID: 37895090 Free PMC article.
-
Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences.Algorithms Mol Biol. 2011 Oct 27;6(1):25. doi: 10.1186/1748-7188-6-25. Algorithms Mol Biol. 2011. PMID: 22032267 Free PMC article.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials