Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Apr;13(4):1071-87.
doi: 10.1110/ps.03379804.

Alignment of protein sequences by their profiles

Affiliations

Alignment of protein sequences by their profiles

Marc A Marti-Renom et al. Protein Sci. 2004 Apr.

Abstract

The accuracy of an alignment between two protein sequences can be improved by including other detectably related sequences in the comparison. We optimize and benchmark such an approach that relies on aligning two multiple sequence alignments, each one including one of the two protein sequences. Thirteen different protocols for creating and comparing profiles corresponding to the multiple sequence alignments are implemented in the SALIGN command of MODELLER. A test set of 200 pairwise, structure-based alignments with sequence identities below 40% is used to benchmark the 13 protocols as well as a number of previously described sequence alignment methods, including heuristic pairwise sequence alignment by BLAST, pairwise sequence alignment by global dynamic programming with an affine gap penalty function by the ALIGN command of MODELLER, sequence-profile alignment by PSI-BLAST, Hidden Markov Model methods implemented in SAM and LOBSTER, pairwise sequence alignment relying on predicted local structure by SEA, and multiple sequence alignment by CLUSTALW and COMPASS. The alignment accuracies of the best new protocols were significantly better than those of the other tested methods. For example, the fraction of the correctly aligned residues relative to the structure-based alignment by the best protocol is 56%, which can be compared with the accuracies of 26%, 42%, 43%, 48%, 50%, 49%, 43%, and 43% for the other methods, respectively. The new method is currently applied to large-scale comparative protein structure modeling of all known sequences.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Composition of the 300 reference alignments that constitute the training and testing sets. (A) Distributions corresponding to the 100 alignments in the training set. (B) Distributions corresponding to the 200 alignments in the testing set. The percentage sequence identity is defined by the ratio of the alignment positions with the same residue types and the number of aligned positions. The RMSD is calculated over the aligned Cα atoms. The percentage of structurally equivalent residues was calculated as the percentage of residues within 3.5 Å after rigid superimposition.
Figure 2.
Figure 2.
Statistical significance of the differences in the accuracies of the tested SALIGN protocols. (Upper diagonal) Gray and white squares indicate pairs of methods whose performance is and is not significantly different at a confidence level of 95%, respectively. (Lower diagonal) The intensity of gray indicates the degree of the average difference between the corresponding methods. (A) The accuracy of a method measured by the average CE overlap. (B) The accuracy of a method measured by the average shift score.
Figure 3.
Figure 3.
The extent of the protein sequence that is aligned by the assessed alignment methods. (A) BLAST, (B) PSI-BLAST, (C) SEA, and (D) COMPASS. In contrast, ALIGN, SAM, CLUSTALW, and our profile–profile alignment protocols generally align the whole of the input protein sequences, either because they rely on global dynamic programming or because the aligned sequences are of similar lengths.
Figure 4.
Figure 4.
Statistical significance of the differences in the accuracies of the tested alignment methods. (Upper diagonal) Gray and white squares indicate pairs of methods whose performance is and is not significantly different at a confidence level of 95%, respectively. (Lower diagonal) The intensity of gray indicates the magnitude of the average difference between the corresponding methods: white indicates no difference in accuracy; black indicates maximum difference. (A) The accuracy of a method measured by the average CE overlap. (B) The accuracy of a method measured by the average shift score.
Figure 5.
Figure 5.
Percentage of the 200 testing alignments as a function of the minimal fraction of structurally equivalent positions at the 5 Å cutoff. The vertical line indicates the threshold of alignments that have structural overlap of at least 30%.
Figure 6.
Figure 6.
Comparative protein structure modeling with SALIGN and PSI-BLAST alignments. The comparative protein structure models were built by satisfaction of spatial restraints, as implemented in MODELLER-7 (Sali and Blundell 1993). The default model building routine ‘model’ was used. (A) CASP4 target T0011. The RMSD errors (percentage of structurally equivalent Cα positions within the 3 Å cutoff) for the SALIGN, PSI-BLAST, and the best CASP4 model (not shown; http://predictioncenter.llnl.gov) are 1.8 Å (96.1%), 1.0 Å (95.8%), and 1.8 Å (96.7%), respectively. (B) CASP4 target T0092. The RMSD errors (percentage of structurally equivalent Cα positions within the 3 Å cutoff) for the SALIGN, PSI-BLAST, and the best CASP4 model (not shown; http://predictioncenter.llnl.gov) are 5.9 Å (67.8%), 4.0 Å (31.7%), and 6.0 Å (65.2%), respectively.

Similar articles

Cited by

References

    1. Abagyan, R.A. and Batalov, S. 1997. Do aligned sequences share the same fold? J. Mol. Biol. 273 355–368. - PubMed
    1. Al Lazikani, B., Sheinerman, F.B., and Honig, B. 2001. Combining multiple structure and sequence alignments to improve sequence detection and alignment: Application to the SH2 domains of Janus kinases. Proc. Natl. Acad. Sci. 98 14796–14801. - PMC - PubMed
    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403–410. - PubMed
    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
    1. Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28 45–48. - PMC - PubMed

Publication types