Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 14;20(3):e1011939.
doi: 10.1371/journal.pcbi.1011939. eCollection 2024 Mar.

Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins

Affiliations

Combining machine learning with structure-based protein design to predict and engineer post-translational modifications of proteins

Moritz Ertelt et al. PLoS Comput Biol. .

Abstract

Post-translational modifications (PTMs) of proteins play a vital role in their function and stability. These modifications influence protein folding, signaling, protein-protein interactions, enzyme activity, binding affinity, aggregation, degradation, and much more. To date, over 400 types of PTMs have been described, representing chemical diversity well beyond the genetically encoded amino acids. Such modifications pose a challenge to the successful design of proteins, but also represent a major opportunity to diversify the protein engineering toolbox. To this end, we first trained artificial neural networks (ANNs) to predict eighteen of the most abundant PTMs, including protein glycosylation, phosphorylation, methylation, and deamidation. In a second step, these models were implemented inside the computational protein modeling suite Rosetta, which allows flexible combination with existing protocols to model the modified sites and understand their impact on protein stability as well as function. Lastly, we developed a new design protocol that either maximizes or minimizes the predicted probability of a particular site being modified. We find that this combination of ANN prediction and structure-based design can enable the modification of existing, as well as the introduction of novel, PTMs. The potential applications of our work include, but are not limited to, glycan masking of epitopes, strengthening protein-protein interactions through phosphorylation, as well as protecting proteins from deamidation liabilities. These applications are especially important for the design of new protein therapeutics where PTMs can drastically change the therapeutic properties of a protein. Our work adds novel tools to Rosetta's protein engineering toolbox that allow for the rational design of PTMs.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Collection of data and feature calculation.
A) For all modifications except N-linked glycosylation and deamidation, data were collected from the dbPTM and sequence windows of ten residues before/after the modified site were filtered with CD-HIT to 90% sequence identity. Predicted structural models were downloaded from the AlphaFold2 database and filtered by overall and local pLDDT over 50. PyRosetta was used to calculate dihedral angles, secondary structure, and solvent-accessible-surface-area (SASA). B) For N-linked glycosylation, structures of eukaryotic proteins produced in a eukaryotic expression system with at least one glycan were collected from the Protein Data Bank (PDB) and sequence windows of ten residues before/after the modified site were filtered with CD-HIT to 90% sequence identity. To avoid false negatives, glycosylation sites were compared to UniProt annotations of experimentally verified glycosylation sites and further manually screened for spurious electron density (potentially representing glycan occupancy) or endoglycosidase treatment, removing any such cases from the dataset. PyRosetta was used to calculate the same set of features as for the other modifications.
Fig 2
Fig 2. Neural network architecture for predicting post-translational modifications (PTMs).
Starting from a Rosetta pose object representing a protein structure and its attributes, sequence and structural features are calculated by already implemented methods in Rosetta and then input into an artificial neural network (ANN) built using the Keras functional API. A) Single PTM classification using an embedded sequence window and structural features as input to two-tracks of fully connected layers. Here, one model is trained for each type of PTM. B) Multi PTM classification using the same features but with an additional transformer layer in the sequence track and an additional fully connected layer in the structure track of the network. This model combines PTM types with unique amino acids in training and therefore predicts probabilities for multiple PTMs.
Fig 3
Fig 3. Using structure-based design to predict deamidation rates of Protein A mutations.
A). Overview of the Protein A structure (PDB ID: 1DEE) with susceptible deamidation sites colored in red and not susceptible asparagines colored in blue. B) Predicted deamidation probabilities for all asparagine residues in Protein A colored by known susceptibility. The prediction threshold of 0.5 is shown as a gray dotted line. C-D) Predicted deamidation probabilities for mutations of residue following (n+1) the asparagine residues N23/N28 compared to the predicted stability as Rosetta energy units (where more negative equals more stable). The prediction threshold of 0.5 is shown as a gray dotted horizontal line, the vertical line identifies the total score of the native amino acid which is marked by a red circle.
Fig 4
Fig 4. Using structure-based modeling to predict experimentally verified glycosylation sites in influenza hemagglutinin.
A). Hemagglutinin structure of the H3N2 Hongkong 1968 (HK 68) influenza strain (PDB ID: 4FNK) with N-linked glycosylation sites visualized through Rosetta glycan modeling (blue). B) N-linked glycosylation sites (orange) of later observed influenza strains threaded onto the original HK 68 structure using structure-based modeling. C) Predicted glycosylation probabilities of known N-linked glycosylation sites from the early HK 68 strain (blue) or later observed strains (orange) which were modeled onto the HK 68 structure. The prediction threshold of 0.5 is shown as a gray dotted line.
Fig 5
Fig 5. Optimizing the predicted phosphorylation probability of a de novo protein using structure-based design.
A). Structure of the de novo serine-kinase driven protein switch from Woodall et al. [44]; originally introduced phosphorylation sites are colored red. Mutations predicted to improve the phosphorylation probability of site S93 are colored in yellow. B) Monte Carlo optimization protocol using the GenericMonteCarloMover, starting from the original protein structure, randomly mutating a neighborhood residue of the phosphorylation site, and then accepting or rejecting the mutation based on the Rosetta total score (using a Metropolis criterion to avoid local minima) and predicted phosphorylation probability. This inner loop is repeated 50 times and the pose with the highest phosphorylation probability is output. C) Predicted phosphorylation probabilities of sites introduced by Woodall et al. [44](red) and other Ser/Thr residues found in the de novo protein. The prediction threshold of 0.5 is shown as a gray dotted line. D) Results of the Monte Carlo optimization protocol for phosphorylation site S93, showing the predicted phosphorylation probability versus the Rosetta total score for 1000 trajectories. The original design is marked as red square and the best design (highest predicted phosphorylation probability) is marked as yellow star. The Rosetta score and predicted phosphorylation probability of the original design is highlighted as blue and yellow dotted line, respectively.

Similar articles

Cited by

References

    1. Walsh G. Post-translational modifications of protein biopharmaceuticals. Drug Discovery Today 2010;15:773–780. doi: 10.1016/j.drudis.2010.06.009 - DOI - PubMed
    1. Schwarz F, Aebi M. Mechanisms and principles of N-linked protein glycosylation. Current Opinion in Structural Biology 2011;21:576–582. - PubMed
    1. Hart GW, Haltiwanger RS, Holt GD, Kelly WG. Nucleoplasmic and cytoplasmic glycoproteins. In Ciba Foundation Symposium 145-Carbohydrate Recognition in Cellular Function: Carbohydrate Recognition in Cellular Function: Ciba Foundation Symposium 145; 2007, 102–18. - PubMed
    1. Shental-Bechor D, Levy Y. Effect of glycosylation on protein folding: a close look at thermodynamic stabilization. Proceedings of the National Academy of Sciences of the United States of America 2008;105:8256–8261. doi: 10.1073/pnas.0801340105 - DOI - PMC - PubMed
    1. Shakin-Eshleman SH, Spitalnik SL, Kasturi L. The Amino Acid at the X Position of an Asn-X-Ser Sequon Is an Important Determinant of N-Linked Core-glycosylation Efficiency. The Journal of Biological Chemistry 1996;271:6363–6366. doi: 10.1074/jbc.271.11.6363 - DOI - PubMed

Grants and funding

This work is supported through a Rosetta mini-grant under award number RC22021 from RosettaCommons (www.rosettacommons.org) held by CTS. ME, JM and CTS acknowledge the financial support by the Federal Ministry of Education and Research of Germany and by the Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus in the program Center of Excellence for AI-research "Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig", project identification number: ScaDS.AI (https://scads.ai/). ME's position is funded through an award by ScaDS.AI. VKM is supported by the Simons Foundation (https://www.simonsfoundation.org/). TS is supported by a Sofja Kovalevskaja prize from the Alexander-von-Humboldt foundation (https://www.humboldt-foundation.de/), while JM is supported by an Alexander-von-Humboldt professorship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.