Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Feb;16(2):216-26.
doi: 10.1110/ps.062523907. Epub 2006 Dec 22.

Evaluation of features for catalytic residue prediction in novel folds

Affiliations

Evaluation of features for catalytic residue prediction in novel folds

Eunseog Youn et al. Protein Sci. 2007 Feb.

Abstract

Structural genomics projects are determining the three-dimensional structure of proteins without full characterization of their function. A critical part of the annotation process involves appropriate knowledge representation and prediction of functionally important residue environments. We have developed a method to extract features from sequence, sequence alignments, three-dimensional structure, and structural environment conservation, and used support vector machines to annotate homologous and nonhomologous residue positions based on a specific training set of residue functions. In order to evaluate this pipeline for automated protein annotation, we applied it to the challenging problem of prediction of catalytic residues in enzymes. We also ranked the features based on their ability to discriminate catalytic from noncatalytic residues. When applying our method to a well-annotated set of protein structures, we found that top-ranked features were a measure of sequence conservation, a measure of structural conservation, a degree of uniqueness of a residue's structural environment, solvent accessibility, and residue hydrophobicity. We also found that features based on structural conservation were complementary to those based on sequence conservation and that they were capable of increasing predictor performance. Using a family nonredundant version of the ASTRAL 40 v1.65 data set, we estimated that the true catalytic residues were correctly predicted in 57.0% of the cases, with a precision of 18.5%. When testing on proteins containing novel folds not used in training, the best features were highly correlated with the training on families, thus validating the approach to nonhomologous catalytic residue prediction in general. We then applied the method to 2781 coordinate files from the structural genomics target pipeline and identified both highly ranked and highly clustered groups of predicted catalytic residues.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
ROC curves for different data sets based on SCOP family, superfamily, and fold. This plot is analogous to the data in Table 1.
Figure 2.
Figure 2.
Feature ranking correlation between family-based and fold-based data sets. Each of 314 features was ranked based on their class discriminating value, by determining the AUC value. Several S-BLEST features are not informative (constant in both catalytic and noncatalytic residues), and these were removed from this plot.
Figure 3.
Figure 3.
Information per position score from PSI-BLAST and SCS from S-BLEST distributions for catalytic and noncatalytic residues.
Figure 4.
Figure 4.
SVM predicted score vs. distance (Å) to the closest catalytic residue. The SVM outputs a predicted score for each residue. Distance (Å) between the residue and the closest catalytic residue is computed. The predicted score x is binned by rounding x to one decimal place (X-axis). The distances in each bin are averaged (Y-axis). Averaged distances are then fitted by a linear regression. Averaged distance decreases almost linearly as predicted score increases.
Figure 5.
Figure 5.
Precision of prediction as decision threshold is shifted. After the SVM predicted scores are sorted in descending order, the top k scores (X-axis) are used to compute the precision (Y-axis), where k = 1,…,5000.
Figure 6.
Figure 6.
Performance by the different types of catalytic residues. Plot on the left considers only catalytic residues. Light bars show the distribution of different types of catalytic residues. Dark bars show sensitivity. Numbers above each residue type letter in the left panel represent precisions of corresponding residue type. Plot on the right shows frequencies of different residue types.
Figure 7.
Figure 7.
Analysis of the top-ranked residues from the solved structural genomics targets. (A) The crystal structure of the hypothetical protein aq_1665 from Aquifex aeolicus as determined by the Midwest Structural Genomics Center (pdb:1XM7 chain A). Here we highlight the four top-scored residues: histidine 145 (SVM score of 4.5), aspartate 50 (3.5), histidine 111 (3.4), and aspartate 7 (3.3). (B) The crystal structure of a hypothetical protein from Enterococcus faecalis V583 as determined by the Midwest Structural Genomics Center (pdb:2AZ4 chain A). Here we highlight the top-scoring residues: aspartate 189 (4.2), histidine 94 (3.8), histidine 92 (3.5), histidine 167 (3.3), and histidine 97 (3.2) along with the bound Zn2+ ions. (C) The crystal structure of YJJV, TATD Homolog from Escherichia coli k12 as determined by the New York Structural Genomics Research Consortium (NYSGRC) (pdb:1ZZM chain A). Again, the top residues are highlighted along with the bound ion. (D) The crystal structure of conserved hypothetical protein from Pseudomonas aeruginosa PAO1 as determined by the Midwest Structural Genomics Center (pdb:1Z7A chain E). Here we highlight the three top-scored residues: histidine 126 (SVM score of 3.1), histidine 259 (SVM score of 3.0), and glutamate 36 (SVM score of 2.3).

Similar articles

Cited by

References

    1. Aloy, P., Querol, E., Aviles, F.X., and Sternberg, M.J. 2001. Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311: 395–408. - PubMed
    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. - PMC - PubMed
    1. Bagley, S. and Altman, R.B. 1995. Characterizing the microenvironments surrounding protein sites. Protein Sci. 4: 622–635. - PMC - PubMed
    1. Bartlett, G.J., Porter, C.T., Borkakoti, N., and Thornton, J.M. 2002. Analysis of catalytic residues in enzyme active sites. J. Mol. Biol. 324: 105–121. - PubMed
    1. Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares, M.J., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97: 262–267. - PMC - PubMed

Publication types

LinkOut - more resources