Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 26;1(3):100014.
doi: 10.1016/j.crmeth.2021.100014. Epub 2021 Jun 21.

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Affiliations

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Wei Zheng et al. Cell Rep Methods. .

Abstract

Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
The C-I-TASSER pipeline for protein structure prediction It starts with contact-map prediction from whole-genome and metagenome sequences based on deep residual convolutional neural networks (top) and LOMETS-based threading template identification (bottom). Full-length structure models are then constructed by iterative REMC fragment assembly simulations under the guidance of the deep-learning contact maps and template-based restraints. Abbreviations are as follows: MSA, multiple sequence alignment; REMC, replica-exchange Monte Carlo.
Figure 2
Figure 2
C-I-TASSER modeling results on the 342 hard targets in the benchmark dataset (A) Comparison between TM scores of the first models built by C-I-TASSER and I-TASSER. (B) TM score of LOMETS templates versus accuracy of the contact map utilized by C-I-TASSER. The red circles denote the targets that can be folded by both C-I-TASSER and I-TASSER with a TM score ≥ 0.5; the black points are the targets that can be folded only by C-I-TASSER and not I-TASSER; the yellow crosses are the targets that can be folded only by I-TASSER and not C-I-TASSER; the blue crosses indicate the targets that cannot be folded by either C-I-TASSER or I-TASSER. (C) An illustrative example from 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase IspF (SCOPe: d3fpia_). The upper left shows the structure superpositions of the best LOMETS template (yellow), I-TASSER first model (pink), and C-I-TASSER first model (cyan) with the target structure (gray), and the lower right displays an overlay of predicted contacts (red) with the contacts of the target structure (gray), as well as the contacts from the C-I-TASSER model (cyan).
Figure 3
Figure 3
Case study of C-I-TASSER folding on the platypus lactating protein (PDB: 4v00) (A) The upper left shows the structure superpositions of the template (yellow) and the C-I-TASSER model (cyan) with the target structure (gray), and the lower right shows the overlay of the contact maps from contact predictors (red), the native structure (gray), and C-I-TASSER model (cyan). (B) Comparison of contact satisfaction rates of the REMC trajectories of C-I-TASSER on two decoys. (C) Comparison of the energy during the REMC cycles for two decoys. (D) Comparison of the model TM scores during the REMC cycles. The structures are the decoy models for different simulation states.
Figure 4
Figure 4
Structural modeling results for unsolved Pfam families (A) The distribution of Pfam families and benchmark targets in different C-score bins. The black circles represent the number of Pfam targets in a specific C-score bin, and histograms are from benchmark proteins; the gray bars indicate the number of foldable targets with TM ≥ 0.5 and the white bars being the number of non-foldable targets. (B) Number of Pfam families at each stage of the analysis, where each set is a subset of the previous set. (C) Venn diagram for the number of foldable models for the Pfam families constructed by C-I-TASSER, Rosetta, DMPfold, and PconsFam. (D) Venn diagram for the number of novel folds for the Pfam families produced by C-I-TASSER, Rosetta, and DMPfold. (E) Comparison of the TM scores for the first models produced by C-I-TASSER versus those by DMPfold (red crosses) and PconsFam (blue circles) for 96 Pfam families that have at least one member newly solved after modeling. (F) Case study of 20 Pfam families regarded as hard by LOMETS. In each case, the model is shown in rainbow color and the solved experimental structure of a member from the same Pfam family, if available, is shown in gray.
Figure 5
Figure 5
Comparison of the C-I-TASSER results for the Pfam families and benchmark dataset for different C scores, Z scores, and Neff values (A) Normalized Z score of the first LOMETS template versus the Neff of DeepMSA for the Pfam families (points) and benchmark dataset (background). The black crosses represent the Pfam targets with C ≥ −2.5, and the gray dots are Pfam targets with C < −2.5. The heatmap in the background depicts the TM scores for benchmark targets, where white regions indicate no data. (B) The box-and-whisker chart for the logarithm Neff values of MSAs for easy and hard targets in the Pfam families and benchmark dataset. The left corresponds to the results of the benchmark dataset, and the right contains the results for the Pfam families. The yellow boxes indicate the hard targets, and the blue boxes are the easy targets.
Figure 6
Figure 6
Application of C-I-TASSER to COVID-19 structure modeling (A) C-I-TASSER models for all 24 proteins in the SARS-CoV-2 genome, including 4 structural proteins and 20 non-structural proteins. (B) The structure superpositions of the C-I-TASSER models (red) with the experimental structures (cyan) for 17 solved SARS-CoV-2 proteins/domains, for which C-I-TASSER created models with correct fold (TM >0.5).

Similar articles

Cited by

References

    1. Adhikari B., Hou J., Cheng J. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics. 2017;34:1466–1472. doi: 10.1093/bioinformatics/btx781. - DOI - PMC - PubMed
    1. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Battey J.N., Kopp J., Bordoli L., Read R.J., Clarke N.D., Schwede T. Automated server predictions in CASP7. Proteins. 2007;69(Suppl 8):68–82. doi: 10.1002/prot.21761. - DOI - PubMed
    1. Browne W.J., North A.C., Phillips D.C., Brew K., Vanaman T.C., Hill R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969;42:65–86. doi: 10.1016/0022-2836(69)90487-2. - DOI - PubMed
    1. Brunger A.T., Adams P.D., Clore G.M., DeLano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.S., Kuszewski J., Nilges M., Pannu N.S., et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr. 1998;54:905–921. doi: 10.1107/s0907444998003254. - DOI - PubMed

Publication types