Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 5;13(1):e1005324.
doi: 10.1371/journal.pcbi.1005324. eCollection 2017 Jan.

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

Affiliations

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

Sheng Wang et al. PLoS Comput Biol. .

Abstract

Motivation: Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction.

Method: This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question.

Results: Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then.

Availability: http://raptorx.uchicago.edu/ContactMap/.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Illustration of our deep learning model for contact prediction where L is the sequence length of one protein under prediction.
Fig 2
Fig 2. Top L/5 accuracy of our method (green), CCMpred (blue) and MetaPSICOV (red) with respect to the amount of homologous information measured by ln(Meff).
The accuracy on the union of the 105 CASP and 76 CAMEO targets is displayed in (A) medium-range and (B) long-range. The accuracy on the membrane protein set is displayed in (C) medium-range and (D) long-range.
Fig 3
Fig 3. Quality comparison of top 1 contact-assisted models generated by our method, CCMpred and MetaPSICOV on the 105 CASP11 targets (red square), 76 CAMEO targets (blue diamond) and 398 membrane protein targets (green triangle), respectively.
(A) and (B): comparison between our method (X-axis) and CCMpred (Y-axis) in terms of TMscore and lDDT, respectively. (C) and (D): comparison between our method (X-axis) and MetaPSICOV (Y-axis) in terms of TMscore and lDDT, respectively. lDDT is scaled to between 0 and 1.
Fig 4
Fig 4
Comparison between our contact-assisted models of the three test sets and their template-based models in terms of (A) TMscore and (B) lDDT score. The top 1 models are evaluated.
Fig 5
Fig 5. Quality comparison (measured by TMscore) of contact-assisted models generated by our server, CCMpred and MetaPSICOV on the 41 CAMEO hard targets.
(A) our server (X-axis) vs. CCMpred and (B) our server (X-axis) vs. MetaPSICOV.
Fig 6
Fig 6. Overlap between top L/2 predicted contacts (in red or green) and the native contact map (in grey) for CAMEO target 2nc8A.
Red (green) dots indicate correct (incorrect) prediction. (A) The comparison between our prediction (in upper-left triangle) and CCMpred (in lower-right triangle). (B) The comparison between our prediction (in upper-left triangle) and MetaPSICOV (in lower-right triangle).
Fig 7
Fig 7. Superimposition between our predicted model (red) and its native structure (blue) for the CAMEO test protein (PDB ID 2nc8 and chain A).
Fig 8
Fig 8. The list of top models submitted by CAMEO servers for 2nc8A and their quality scores.
The rightmost column displays the TMscore of submitted models. Server60 is our contact web server.
Fig 9
Fig 9. Overlap between top L/2 predicted contacts (in red or green) and the native contact map (in grey) for CAMEO target 5dcjA.
Red (green) dots indicate correct (incorrect) prediction. (A) The comparison between our prediction (in upper-left triangle) and CCMpred (in lower-right triangle). (B) The comparison between our prediction (in upper-left triangle) and MetaPSICOV (in lower-right triangle).
Fig 10
Fig 10. Superimposition between the predicted models (red) and the native structure (blue) for the CAMEO test protein (PDB ID 5dcj and chain A).
The models are built by CNS from the contacts predicted by (A) our method, (B) CCMpred, and (C) MetaPSICOV.
Fig 11
Fig 11. The list of top models submitted by CAMEO-participating servers for 5dcjA and their quality scores.
The rightmost column displays the TMscore of submitted models. Server60 is our contact web server.
Fig 12
Fig 12. Overlap between top L/2 predicted contacts (in red and green) and the native contact map (in grey) for CAMEO target 5djeB.
Red (green) dots indicate correct (incorrect) prediction. (A) The comparison between our prediction (in upper-left triangle) and CCMpred (in lower-right triangle). (B)The comparison between our prediction (in upper-left triangle) and MetaPSICOV (in lower-right triangle).
Fig 13
Fig 13. Superimposition between the predicted models (red) and the native structure (blue) for the CAMEO test protein (PDB ID 5dje and chain B).
The models are built by CNS from the contacts predicted by (A) our method, (B) CCMpred, and (C) MetaPSICOV.
Fig 14
Fig 14. The list of top models submitted by CAMEO-participating servers for 5djeB and their quality scores.
The rightmost column displays the TMscore of submitted models. Server60 is our contact web server.
Fig 15
Fig 15. Overlap between top L/2 predicted contacts (in red and green) and the native contact map (in grey) for CAMEO target 5f5pH.
Red (green) dots indicate correct (incorrect) prediction. (A) The comparison between our prediction (in upper-left triangle) and CCMpred (in lower-right triangle). (B) The comparison between our prediction (in upper-left triangle) and MetaPSICOV (in lower-right triangle).
Fig 16
Fig 16. Superimposition between the predicted models (red) and the native structure (blue) for the CAMEO target 5f5pH.
The models are built by CNS from the contacts predicted by (A) our method, (B) CCMpred, and (C) MetaPSICOV.
Fig 17
Fig 17. The list of top models submitted by CAMEO-participating servers for 5f5pH and their quality scores.
The rightmost column displays the TMscore of submitted models. Server60 is our contact web server.
Fig 18
Fig 18
(A) Structure superimposition of Drosophila SD2 and Human SD2. (B) Conformation change of Drosophila SD2 in binding with Rock-SBD.
Fig 19
Fig 19. Overlap between predicted contacts (in red and green) and the native (in grey) for CAMEO target 5flgB.
Red (green) dots indicate correct (incorrect) prediction. Top L/2 predicted contacts by each method are shown. The left picture shows the comparison between our prediction (in upper-left triangle) and CCMpred (in lower-right triangle) and the right picture shows the comparison between our prediction (in upper-left triangle) and MetaPSICOV (in lower-right triangle).
Fig 20
Fig 20. Superimposition between the predicted models (red) and the native structure (blue) for the CAMEO test protein 5flgB.
The models are built by CNS from the contacts predicted by (A) our method, (B) CCMpred, and (C) MetaPSICOV.
Fig 21
Fig 21. The list of top models submitted by CAMEO-participating servers for 5flgB and their quality scores.
The rightmost column displays the model TMscore. Server60 is our contact web server.
Fig 22
Fig 22. A building block of our residual network with Xl and Xl+1 being input and output, respectively.
Each block consists of two convolution layers and two activation layers.

Similar articles

Cited by

References

    1. Kim DE, DiMaio F, Yu‐Ruei Wang R, Song Y, Baker D. One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling. Proteins: Structure, Function, and Bioinformatics. 2014;82(S2):208–18. - PMC - PubMed
    1. de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nature Reviews Genetics. 2013;14(4):249–61. 10.1038/nrg3414 - DOI - PubMed
    1. Weigt M, White RA, Szurmant H, Hoch JA, Hwa T . Identification of direct residue contacts in protein-protein interaction by message passing. P Natl Acad Sci USA. 2009;106(1):67–72. - PMC - PubMed
    1. Seemayer S, Gruber M, Söding J. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics. 2014;30(21):3128–30. 10.1093/bioinformatics/btu500 - DOI - PMC - PubMed
    1. Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–90. 10.1093/bioinformatics/btr638 - DOI - PubMed