Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014;15 Suppl 2(Suppl 2):S1.
doi: 10.1186/1471-2105-15-S2-S1. Epub 2014 Jan 24.

Automatic classification of protein structures using low-dimensional structure space mappings

Automatic classification of protein structures using low-dimensional structure space mappings

Daniel Asarnow et al. BMC Bioinformatics. 2014.

Abstract

Background: Protein function is closely intertwined with protein structure. Discovery of meaningful structure-function relationships is of utmost importance in protein biochemistry and has led to creation of high-quality, manually curated classification databases, such as the gold-standard SCOP (Structural Classification of Proteins) database. The SCOP database and its counterparts such as CATH provide a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure and are widely employed in structural and computational biology. Since manual classification is both subjective and highly laborious, automated classification of novel structures is increasingly an active area of research. The design of methods for automated structure classification has been rendered even more important since the recent past, due to the explosion in number of solved structures arising out of various structural biology initiatives. In this paper we propose an approach to the problem of structure classification based on creating and tessellating low dimensional maps of the protein structure space (MPSS). Given a set of protein structures, an MPSS is a low dimensional embedding of structural similarity-based distances between the molecules. In an MPSS, a group of proteins (such as all the proteins in the PDB or sub-samplings thereof) under consideration are represented as point clouds and structural relatedness maps to spatial adjacency of the points. In this paper we present methods and results that show that MPSS can be used to create tessellations of the protein space comparable to the clade systems within SCOP. Though we have used SCOP as the gold standard, the proposed approach is equally applicable for other structural classifications.

Methods: In the proposed approach, we first construct MPSS using pairwise alignment distances obtained from four established structure alignment algorithms (CE, Dali, FATCAT and MATT). The low dimensional embeddings are next computed using an embedding technique called multidimensional scaling (MDS). Next, by using the remotely homologous Superfamily and Fold levels of the hierarchical SCOP database, a distance threshold is determined to relate adjacency in the low dimensional map to functional relationships. In our approach, the optimal threshold is determined as the value that maximizes the total true classification rate vis-a-vis the SCOP classification. We also show that determining such a threshold is often straightforward, once the structural relationships are represented using MPSS.

Results and conclusion: We demonstrate that MPSS constitute highly accurate representations of protein fold space and enable automatic classification of SCOP Superfamily and Fold-level relationships. The results from our automatic classification approach are remarkably similar to those found in the distantly homologous Superfamily level and the quite remotely homologous Fold levels of SCOP. The significance of our results are underlined by the fact that most automated methods developed thus far have only managed to match the closest-homology Family level of the SCOP hierarchy and tend to differ considerably at the Superfamily and Fold levels. Furthermore, our research demonstrates that projection into a low-dimensional space using MDS constitutes a superior noisereducing transformation of pairwise distances than do the variety of probability- and alignment-length-based transformations currently used by structure alignment algorithms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Convergence of SMACOF with random and CMDS initializations. Left: initial stress is much greater with the random initialization and takes many more iterations of SMACOF to converge. Right: The Procrustes statistic between the random and CMDS initializations shows that their SMACOF coordinates become very similar as the number of iterations increase.
Figure 2
Figure 2
MPSS MATT(C3). A 3D MPSS, constructed using CMDS in conjuction with raw MATT distances. Points in the MPSS are colored by SCOP Class. The reader may note the strong separation of the major protein classes. In particular, small proteins ('g,' green) cluster densely near the origin, while the all alpha ('a,' brown) and all beta ('b,' blue) classes form two roughly orthogonal axial structures. Between these lies the α+β class ('d,' magenta), with the α/β class ('c,' cyan) rising high above the α,β plane.
Figure 3
Figure 3
MPSS MATT(S3). A 3D MPSS constructed using SMACOF and raw MATT distances. As in Figure 2, the protein classes are well separated. However, the qualitative appearance of the MPSS is very different and points in the MPSS appear to lie near a spherical manifold. Compared with Figure 2, the α/β class ('c,' cyan) is very tightly clustered, while the small proteins ('g,' green) are spread farther apart.
Figure 4
Figure 4
Example ROC and threshold-accuracy curves. A) Plot of true classification rate vs. threshold distance for a 24D MPSS based on Dali and SMACOF. The maximum, corresponding to the intersection of the ROC curve and the iso-performance tangent, is marked in red. B) ROC curve for the same set of MPSS distances. The point associated with the best threshold from A) is indicated by the red circle.
Figure 5
Figure 5
Neighbor-joining tree for Dali(S12). A) The complete cluster tree for distances from the 12D SMACOF MPSS based on Dali. Colors indicate membership in the 200 most populated SCOP Superfamilies. Because it is difficult to distinguish highly similar colors, a smaller number of distinct colors are reused periodically across the tree, while guaranteeing that color labels are locally unique. The reader may note that different superfamilies are found at different depths within the tree, so that the use of a single threshold for all clusters is only approximate. B) The part of the tree that corresponds to the PH domain-like superfamily. This superfamily is entirely and exclusively contained within a single subtree. C) The region of the tree containing the Immunoglobin and Fibronectin Type III superfamilies. These superfamilies do not correspond either to a single subtree or to a single depth within the tree.
Figure 6
Figure 6
MPSS with selected SCOP superfamilies highlighted. The figure shows a view of FATCAT(C3), with several selected SCOP superfamilies highlighted. Points representing members of other superfamilies are transparent and dimmed, but present in the map. It can be seen that the superfamilies are well separated in the mapping space.
Figure 7
Figure 7
AUC for SCOP Superfamily prediction vs. MPSS dimensionality. The Superfamily classification AUC are shown for all 24 PSS representations. The AUC is plotted against MPSS dimensionality; pairwise distances are depicted as a flat line for reference. The legend gives the curves using the PSS name code containing the aligner name, Z for probability scores, C for classical scaling and S for SMACOF. Detailed descriptions of the trends found in this figure are given in the text. The figure shows that the MPSS Dali(S12-30) and MATT(S12-30) are most accurate.
Figure 8
Figure 8
AUC for SCOP Fold prediction vs. MPSS dimensionality. Fold classification AUC are shown for all 24 PSS representations. The AUC is plotted against the MPSS dimensionality; pairwise distances are depicted as a flat line for reference. Detailed descriptions of the trends found in this figure are given in the text. The figure shows that the MPSS Dali(S12) is the most accurate at this SCOP level.
Figure 9
Figure 9
Histogram of completed Dali alignments. A 100-point histogram of the number of successful Dali alignments for each structure. The histogram counts fit tightly to a power law, R2 = 0.988. This corroborates the hypothesis that high-degree nodes mediate the relationships between most other pairs of proteins in a scale-free manner.
Figure 10
Figure 10
Histogram of cluster sizes for SCOP Superfamily level. Frequency of occurrence is plotted versus the size of clusters obtained by hierarchical clustering using the Superfamily-level distance threshold. Pairwise MATT distances result in a large number of small clusters, including many singletons. For MPSS, in comparison, a smaller number of larger clusters which map more directly to the SCOP Superfamily classification are obtained.
Figure 11
Figure 11
Histogram of cluster sizes for SCOP Fold level. The frequency of occurrence is plotted against the size of clusters obtained by hierarchical clustering using the Fold-level distance threshold. As for the SCOP Superfamily level, pairwise MATT distances result in a large number of small clusters, including many singletons. In comparison, for MPSS, a smaller number of larger clusters which map more directly to the SCOP Fold classification are obtained.

Similar articles

Cited by

  • Structural Bridges through Fold Space.
    Edwards H, Deane CM. Edwards H, et al. PLoS Comput Biol. 2015 Sep 15;11(9):e1004466. doi: 10.1371/journal.pcbi.1004466. eCollection 2015 Sep. PLoS Comput Biol. 2015. PMID: 26372166 Free PMC article.

References

    1. Orengo C, Pearl F, Bray J, Todd A, Martin A, Lo Conte L, Thornton J. The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Research. 1999;27:275–279. doi: 10.1093/nar/27.1.275. - DOI - PMC - PubMed
    1. Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Research. 2007;36:D419–D425. doi: 10.1093/nar/gkm993. - DOI - PMC - PubMed
    1. Csaba G, Birzele F, Zimmer R. Systematic comparison of SCOP and CATH: a new gold standard for protein structure analysis. BMC Structural Biology. 2009;9:23. doi: 10.1186/1472-6807-9-23. - DOI - PMC - PubMed
    1. Orengo CA, Flores TP, Taylor WR, Thornton JM. Identification and classification of protein fold families. Protein Engineering. 1993;6:485–500. doi: 10.1093/protein/6.5.485. - DOI - PubMed
    1. Holm L, Sander C. Mapping the Protein Universe. Science. 1996;273:595–602. doi: 10.1126/science.273.5275.595. - DOI - PubMed

Publication types