Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan 28;9(1):e87107.
doi: 10.1371/journal.pone.0087107. eCollection 2014.

Improving predictions of protein-protein interfaces by combining amino acid-specific classifiers based on structural and physicochemical descriptors with their weighted neighbor averages

Affiliations

Improving predictions of protein-protein interfaces by combining amino acid-specific classifiers based on structural and physicochemical descriptors with their weighted neighbor averages

Fábio R de Moraes et al. PLoS One. .

Abstract

Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html).

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Comparing LDA classifiers with ROC analysis.
Performance evaluation using ROC analysis for the tryptophan LDA classifier (a), glycine LDA classifier (b) and the aggregated result (in gray) of the 20 independent amino acid LDA classifiers (d). In blue, 10 ROC curves (from 10-fold cross validation tests) are presented for the classifier that is not specific to the type of amino acid. Ten-fold cross validation was used, and the performance indicators AUC and MCC are displayed for both classifiers in (a) and (b). The results for all generated classifiers are shown in (c): the AUC (white) and MCC (gray). Tryptophan serves as the best classifier but is closely followed by (in AUC criteria) aspartic acid, methionine, isoleucine, leucine and valine. The Sting-LDA aggregated classifier, which uses 20 amino acid-specific classifiers with no WNA descriptors, has an AUC average of 0.828, whereas the amino acid-unspecific classifier has an AUC average of 0.751.
Figure 2
Figure 2. Comparing LDA classifiers using weighted neighbor averages (WNA) descriptors with ROC analysis.
The results for all generated amino acid classifiers are shown in (a): the AUC (white) and MCC (gray). The amino acid ranking order is similar to that of figure 1-c, except for glycine and cysteine LDA models. The Sting-LDA-WNA aggregated classifier from the 20 amino acid-specific classifiers with WNA descriptors has an AUC average of 0.949, whereas the amino acid-unspecific classifier average was 0.944 (indicating that the performance gain while using the formulated amino acid-specific classifiers is statistically relevant and, therefore, recommended for better IFR classification).
Figure 3
Figure 3. Cut-off performance dependence of Sting-LDA-WNA classifier.
The performance indicators observed for the classifiers built from DS30. When the cut-off is increased from 0.10 to 0.50, the accuracy increases and reaches its peak, and then it gradually decreases with a further increase in the cut-off value. The precision rate only grows as the cut-off is increased, that is, for higher cut-off values, fewer entries are classified as IFR, leading to more false positives. For sensitivity, the opposite behavior is observed (illustrating the performance trade-off). When increasing the cut-off, more entries are labeled as FSR, and fewer labeled as IFR are misclassified. The highest MCC value occurred when using the same cut-off as that for the highest accuracy, which is 0.50. Box plots were obtained with 10-fold cross validation.
Figure 4
Figure 4. IFR prediction performance dependence on cut-off values for the LDA classifier with conservation attributes and comparison with Sting-LDA-WNA.
The performance of the classifier with amino acid conservation descriptor: (a). above the classification cut-off of 0.5, the precision rate is always above 80%, reaching more than 95% with a cut-off of 0.9. The MCC rate is higher for a 0.3 cut-off; nevertheless, using a cut-off of 0.5 results in a similar MCC. (b) Comparing the performance of Sting-LDA-WNA with STING-LDA, no difference is noted for the three selected cut-off values.
Figure 5
Figure 5. Comparison of Sting-LDA-WNA to other methods based on the test set 35Enz and induced fit assessment on benchmark 4.0 cases.
(a) By selecting different thresholds for sensitivity, or coverage, the precision of the methods for IFR classification can be compared. For high interface coverage values (75%), Sting-LDA-WNA (marked in this figure as STING-LDA) has the highest precision among the used methods (37%). For balanced coverage (50%), Sting-LDA ranks third (47%), but not distant from PINUP (48%) and Meta-PPISP (50%) methods. For lower coverage (25%), Meta-PPISP still ranks first achieving 70% precision while PINUP and Sting-LDA have similar precision (59%). (b) Sting-LDA-WNA performance on the “medium” and “difficult” classes of the protein-protein docking benchmark, resulting in 6% decrease as compared to the DS30 performance, by using the AUC rate, achieving 0.72.
Figure 6
Figure 6. Percentage of chains (relative to the total number present in the DS35), ordered by the size of their interfaces, showing number of cysteine residues located at those interfaces.
The numbers on the x-axis represent the number of cysteine residues at the interface. The interfaces are grouped in three major groups, ordered by the size. The largest number of cysteine residues is encountered at very large interfaces.
Figure 7
Figure 7. Filters used to establish DS30.
Filters were used in sequential order to eliminate the structures accessed from three databases (PDB, PISA and UniProt) and to select the protein complexes used for this study. In November 2010, there were 68,997 structures in the PDB, and 60,031 of those entries were solved using x-ray crystallography. Removing DNA/RNA chains resulted in 55,962 files. Only 29,578 entries had an oligomeric state as defined by a PISA equal to the PDB asymmetric unit. Approximately half of these entries were found to have at least two chains. We defined a crystal resolution threshold, and only used structures that were solved with resolutions up to 3 Å. The double filter removed chains smaller than 50 residues and interface areas smaller than 200 Å2, yielding 14,094 files. A few entries were removed because for having a fragment flag in the UniProt database. The last filter removed entries with 30% positional sequence identity. The resulting datasets are referred to as DS30 (4,219 PDB files).

Similar articles

Cited by

References

    1. Reichmann D, Rahat O, Cohen M, Neuvirth H, Schreiber G (2007) The Molecular Architecture of protein-protein binding sites. Curr Opin Struct Biol 17: 67–76. - PubMed
    1. Li N, Sun Z, Jiang F (2008) Prediction of protein-protein binding site by using core interface residue and support vector machine. BMC Bioinformatics 9: 553. - PMC - PubMed
    1. Xenarios I, Eisenberg D (2001) Protein interaction databases. Curr Opin Biotech 12: 334–339. - PubMed
    1. Chen H, Skolnick J (2008) M-TASSER: An algorithm for protein quaternary structure prediction. Biophysical Journal 94: 918–928. - PMC - PubMed
    1. Jones S, Thornton JM (1996) Principles of protein-protein interactions. . Proc. Natl. Acad. Sci. 93: 13–20. - PMC - PubMed

Publication types

Grants and funding

The authors thank Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, Grants #2009/03108-1 and #2009/16376-4). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.