Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

doi:10.1016/j.crmeth.2021.100014

. 2021 Jul 26;1(3):100014.

doi: 10.1016/j.crmeth.2021.100014. Epub 2021 Jun 21.

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Wei Zheng^{1

2}, Chengxin Zhang^{1

2}, Yang Li¹, Robin Pearce¹, Eric W Bell¹, Yang Zhang^{1

3

4}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
² These authors contributed equally.
³ Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
⁴ Lead contact.

PMID: 34355210
PMCID: PMC8336924
DOI: 10.1016/j.crmeth.2021.100014

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Wei Zheng et al. Cell Rep Methods. 2021.

. 2021 Jul 26;1(3):100014.

doi: 10.1016/j.crmeth.2021.100014. Epub 2021 Jun 21.

Authors

Wei Zheng^{1

2}, Chengxin Zhang^{1

2}, Yang Li¹, Robin Pearce¹, Eric W Bell¹, Yang Zhang^{1

3

4}

Affiliations

¹ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
² These authors contributed equally.
³ Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
⁴ Lead contact.

PMID: 34355210
PMCID: PMC8336924
DOI: 10.1016/j.crmeth.2021.100014

Abstract

Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS The authors declare no competing interests.

Figures

**Figure 1**
The C-I-TASSER pipeline for protein structure prediction It starts with contact-map prediction from whole-genome and metagenome sequences based on deep residual convolutional neural networks (top) and LOMETS-based threading template identification (bottom). Full-length structure models are then constructed by iterative REMC fragment assembly simulations under the guidance of the deep-learning contact maps and template-based restraints. Abbreviations are as follows: MSA, multiple sequence alignment; REMC, replica-exchange Monte Carlo.

**Figure 2**
C-I-TASSER modeling results on the 342 hard targets in the benchmark dataset (A) Comparison between TM scores of the first models built by C-I-TASSER and I-TASSER. (B) TM score of LOMETS templates versus accuracy of the contact map utilized by C-I-TASSER. The red circles denote the targets that can be folded by both C-I-TASSER and I-TASSER with a TM score ≥ 0.5; the black points are the targets that can be folded only by C-I-TASSER and not I-TASSER; the yellow crosses are the targets that can be folded only by I-TASSER and not C-I-TASSER; the blue crosses indicate the targets that cannot be folded by either C-I-TASSER or I-TASSER. (C) An illustrative example from 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase IspF (SCOPe: d3fpia_). The upper left shows the structure superpositions of the best LOMETS template (yellow), I-TASSER first model (pink), and C-I-TASSER first model (cyan) with the target structure (gray), and the lower right displays an overlay of predicted contacts (red) with the contacts of the target structure (gray), as well as the contacts from the C-I-TASSER model (cyan).

**Figure 3**
Case study of C-I-TASSER folding on the platypus lactating protein (PDB: 4v00) (A) The upper left shows the structure superpositions of the template (yellow) and the C-I-TASSER model (cyan) with the target structure (gray), and the lower right shows the overlay of the contact maps from contact predictors (red), the native structure (gray), and C-I-TASSER model (cyan). (B) Comparison of contact satisfaction rates of the REMC trajectories of C-I-TASSER on two decoys. (C) Comparison of the energy during the REMC cycles for two decoys. (D) Comparison of the model TM scores during the REMC cycles. The structures are the decoy models for different simulation states.

**Figure 4**
Structural modeling results for unsolved Pfam families (A) The distribution of Pfam families and benchmark targets in different C-score bins. The black circles represent the number of Pfam targets in a specific C-score bin, and histograms are from benchmark proteins; the gray bars indicate the number of foldable targets with TM ≥ 0.5 and the white bars being the number of non-foldable targets. (B) Number of Pfam families at each stage of the analysis, where each set is a subset of the previous set. (C) Venn diagram for the number of foldable models for the Pfam families constructed by C-I-TASSER, Rosetta, DMPfold, and PconsFam. (D) Venn diagram for the number of novel folds for the Pfam families produced by C-I-TASSER, Rosetta, and DMPfold. (E) Comparison of the TM scores for the first models produced by C-I-TASSER versus those by DMPfold (red crosses) and PconsFam (blue circles) for 96 Pfam families that have at least one member newly solved after modeling. (F) Case study of 20 Pfam families regarded as hard by LOMETS. In each case, the model is shown in rainbow color and the solved experimental structure of a member from the same Pfam family, if available, is shown in gray.

**Figure 5**
Comparison of the C-I-TASSER results for the Pfam families and benchmark dataset for different C scores, Z scores, and Neff values (A) Normalized Z score of the first LOMETS template versus the Neff of DeepMSA for the Pfam families (points) and benchmark dataset (background). The black crosses represent the Pfam targets with C ≥ −2.5, and the gray dots are Pfam targets with C < −2.5. The heatmap in the background depicts the TM scores for benchmark targets, where white regions indicate no data. (B) The box-and-whisker chart for the logarithm Neff values of MSAs for easy and hard targets in the Pfam families and benchmark dataset. The left corresponds to the results of the benchmark dataset, and the right contains the results for the Pfam families. The yellow boxes indicate the hard targets, and the blue boxes are the easy targets.

**Figure 6**
Application of C-I-TASSER to COVID-19 structure modeling (A) C-I-TASSER models for all 24 proteins in the SARS-CoV-2 genome, including 4 structural proteins and 20 non-structural proteins. (B) The structure superpositions of the C-I-TASSER models (red) with the experimental structures (cyan) for 17 solved SARS-CoV-2 proteins/domains, for which C-I-TASSER created models with correct fold (TM >0.5).

See this image and copyright information in PMC

Cited by

Integration: Gospel for immune bioinformatician on epitope-based therapy.
Sun B, Zhang J, Li Z, Xie M, Luo C, Wang Y, Chen L, Wang Y, Jiang D, Yang K. Sun B, et al. Front Immunol. 2023 Jan 31;14:1075419. doi: 10.3389/fimmu.2023.1075419. eCollection 2023. Front Immunol. 2023. PMID: 36798136 Free PMC article. No abstract available.
Molecular and Genetic Characterization of Hepatitis B Virus (HBV) among Saudi Chronically HBV-Infected Individuals.
Di Stefano M, Faleo G, Leitner T, Zheng W, Zhang Y, Hassan A, Alwazzeh MJ, Fiore JR, Ismail M, Santantonio TA. Di Stefano M, et al. Viruses. 2023 Feb 6;15(2):458. doi: 10.3390/v15020458. Viruses. 2023. PMID: 36851671 Free PMC article.
Genome wide association analysis for grain micronutrients and anti-nutritional traits in mungbean [Vigna radiata (L.) R. Wilczek] using SNP markers.
Sinha MK, Aski MS, Mishra GP, Kumar MBA, Yadav PS, Tokas JP, Gupta S, Pratap A, Kumar S, Nair RM, Schafleitner R, Dikshit HK. Sinha MK, et al. Front Nutr. 2023 Feb 7;10:1099004. doi: 10.3389/fnut.2023.1099004. eCollection 2023. Front Nutr. 2023. PMID: 36824166 Free PMC article.
In Silico Structural Analysis Predicting the Pathogenicity of PLP1 Mutations in Multiple Sclerosis.
Avramouli A, Krokidis MG, Exarchos TP, Vlamos P. Avramouli A, et al. Brain Sci. 2022 Dec 24;13(1):42. doi: 10.3390/brainsci13010042. Brain Sci. 2022. PMID: 36672024 Free PMC article.
Examination of phase-variable haemoglobin-haptoglobin binding proteins in non-typeable Haemophilus influenzae reveals a diverse distribution of multiple variants.
Phillips ZN, Jennison AV, Whitby PW, Stull TL, Staples M, Atack JM. Phillips ZN, et al. FEMS Microbiol Lett. 2022 Aug 1;369(1):fnac064. doi: 10.1093/femsle/fnac064. FEMS Microbiol Lett. 2022. PMID: 35867873 Free PMC article.

See all "Cited by" articles

References

1. Adhikari B., Hou J., Cheng J. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics. 2017;34:1466–1472. doi: 10.1093/bioinformatics/btx781. - DOI - PMC - PubMed
1. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Battey J.N., Kopp J., Bordoli L., Read R.J., Clarke N.D., Schwede T. Automated server predictions in CASP7. Proteins. 2007;69(Suppl 8):68–82. doi: 10.1002/prot.21761. - DOI - PubMed
1. Browne W.J., North A.C., Phillips D.C., Brew K., Vanaman T.C., Hill R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969;42:65–86. doi: 10.1016/0022-2836(69)90487-2. - DOI - PubMed
1. Brunger A.T., Adams P.D., Clore G.M., DeLano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.S., Kuszewski J., Nilges M., Pannu N.S., et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr. 1998;54:905–921. doi: 10.1107/s0907444998003254. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

T32 CA140044/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

[1] Adhikari B., Hou J., Cheng J. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics. 2017;34:1466–1472. doi: 10.1093/bioinformatics/btx781. - DOI - PMC - PubMed

[2] Adhikari B., Hou J., Cheng J. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics. 2017;34:1466–1472. doi: 10.1093/bioinformatics/btx781. - DOI - PMC - PubMed

[3] Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

[4] Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed

[5] Battey J.N., Kopp J., Bordoli L., Read R.J., Clarke N.D., Schwede T. Automated server predictions in CASP7. Proteins. 2007;69(Suppl 8):68–82. doi: 10.1002/prot.21761. - DOI - PubMed

[6] Battey J.N., Kopp J., Bordoli L., Read R.J., Clarke N.D., Schwede T. Automated server predictions in CASP7. Proteins. 2007;69(Suppl 8):68–82. doi: 10.1002/prot.21761. - DOI - PubMed

[7] Browne W.J., North A.C., Phillips D.C., Brew K., Vanaman T.C., Hill R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969;42:65–86. doi: 10.1016/0022-2836(69)90487-2. - DOI - PubMed

[8] Browne W.J., North A.C., Phillips D.C., Brew K., Vanaman T.C., Hill R.L. A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J. Mol. Biol. 1969;42:65–86. doi: 10.1016/0022-2836(69)90487-2. - DOI - PubMed

[9] Brunger A.T., Adams P.D., Clore G.M., DeLano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.S., Kuszewski J., Nilges M., Pannu N.S., et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr. 1998;54:905–921. doi: 10.1107/s0907444998003254. - DOI - PubMed

[10] Brunger A.T., Adams P.D., Clore G.M., DeLano W.L., Gros P., Grosse-Kunstleve R.W., Jiang J.S., Kuszewski J., Nilges M., Pannu N.S., et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D Biol. Crystallogr. 1998;54:905–921. doi: 10.1107/s0907444998003254. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Affiliations

Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous