Highly accurate protein structure prediction with AlphaFold

doi:10.1038/s41586-021-03819-2

. 2021 Aug;596(7873):583-589.

doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold

John Jumper^#¹, Richard Evans^#², Alexander Pritzel^#², Tim Green^#², Michael Figurnov^#², Olaf Ronneberger^#², Kathryn Tunyasuvunakool^#², Russ Bates^#², Augustin Žídek^#², Anna Potapenko^#², Alex Bridgland^#², Clemens Meyer^#², Simon A A Kohl^#², Andrew J Ballard^#², Andrew Cowie^#², Bernardino Romera-Paredes^#², Stanislav Nikolov^#², Rishub Jain^#², Jonas Adler², Trevor Back², Stig Petersen², David Reiman², Ellen Clancy², Michal Zielinski², Martin Steinegger^{3

4}, Michalina Pacholska², Tamas Berghammer², Sebastian Bodenstein², David Silver², Oriol Vinyals², Andrew W Senior², Koray Kavukcuoglu², Pushmeet Kohli², Demis Hassabis^#⁵

Affiliations

¹ DeepMind, London, UK. jumper@deepmind.com.
² DeepMind, London, UK.
³ School of Biological Sciences, Seoul National University, Seoul, South Korea.
⁴ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
⁵ DeepMind, London, UK. dhcontact@deepmind.com.

^# Contributed equally.

PMID: 34265844
PMCID: PMC8371605
DOI: 10.1038/s41586-021-03819-2

Highly accurate protein structure prediction with AlphaFold

John Jumper et al. Nature. 2021 Aug.

. 2021 Aug;596(7873):583-589.

doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Authors

Affiliations

¹ DeepMind, London, UK. jumper@deepmind.com.
² DeepMind, London, UK.
³ School of Biological Sciences, Seoul National University, Seoul, South Korea.
⁴ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
⁵ DeepMind, London, UK. dhcontact@deepmind.com.

^# Contributed equally.

PMID: 34265844
PMCID: PMC8371605
DOI: 10.1038/s41586-021-03819-2

Abstract

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort^1-4, the structures of around 100,000 unique proteins have been determined⁵, but this represents a small fraction of the billions of known protein sequences^6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'⁸-has been an important open research problem for more than 50 years⁹. Despite recent progress^10-14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)¹⁵, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.

PubMed Disclaimer

Conflict of interest statement

J.J., R.E., A. Pritzel, T.G., M.F., O.R., R.B., A.B., S.A.A.K., D.R. and A.W.S. have filed non-provisional patent applications 16/701,070 and PCT/EP2020/084238, and provisional patent applications 63/107,362, 63/118,917, 63/118,918, 63/118,921 and 63/118,919, each in the name of DeepMind Technologies Limited, each pending, relating to machine learning for predicting protein structures. The other authors declare no competing interests.

Figures

**Fig. 1. AlphaFold produces highly accurate structures.**
a, The performance of AlphaFold on the CASP14 dataset (n = 87 protein domains) relative to the top-15 entries (out of 146 entries), group numbers correspond to the numbers assigned to entrants by CASP. Data are median and the 95% confidence interval of the median, estimated from 10,000 bootstrap samples. b, Our prediction of CASP14 target T1049 (PDB 6Y4F, blue) compared with the true (experimental) structure (green). Four residues in the C terminus of the crystal structure are B-factor outliers and are not depicted. c, CASP14 target T1056 (PDB 6YJ1). An example of a well-predicted zinc-binding site (AlphaFold has accurate side chains even though it does not explicitly predict the zinc ion). d, CASP target T1044 (PDB 6VR4)—a 2,180-residue single chain—was predicted with correct domain packing (the prediction was made after CASP using AlphaFold without intervention). e, Model architecture. Arrows show the information flow among the various components described in this paper. Array shapes are shown in parentheses with s, number of sequences (N_seq in the main text); r, number of residues (N_res in the main text); c, number of channels.

**Fig. 2. Accuracy of AlphaFold on recent PDB structures.**
The analysed structures are newer than any structure in the training set. Further filtering is applied to reduce redundancy (see Methods). a, Histogram of backbone r.m.s.d. for full chains (Cα r.m.s.d. at 95% coverage). Error bars are 95% confidence intervals (Poisson). This dataset excludes proteins with a template (identified by hmmsearch) from the training set with more than 40% sequence identity covering more than 1% of the chain (n = 3,144 protein chains). The overall median is 1.46 Å (95% confidence interval = 1.40–1.56 Å). Note that this measure will be highly sensitive to domain packing and domain accuracy; a high r.m.s.d. is expected for some chains with uncertain packing or packing errors. b, Correlation between backbone accuracy and side-chain accuracy. Filtered to structures with any observed side chains and resolution better than 2.5 Å (n = 5,317 protein chains); side chains were further filtered to B-factor <30 Å². A rotamer is classified as correct if the predicted torsion angle is within 40°. Each point aggregates a range of lDDT-Cα, with a bin size of 2 units above 70 lDDT-Cα and 5 units otherwise. Points correspond to the mean accuracy; error bars are 95% confidence intervals (Student t-test) of the mean on a per-residue basis. c, Confidence score compared to the true accuracy on chains. Least-squares linear fit lDDT-Cα = 0.997 × pLDDT − 1.17 (Pearson’s r = 0.76). n = 10,795 protein chains. The shaded region of the linear fit represents a 95% confidence interval estimated from 10,000 bootstrap samples. In the companion paper, additional quantification of the reliability of pLDDT as a confidence measure is provided. d, Correlation between pTM and full chain TM-score. Least-squares linear fit TM-score = 0.98 × pTM + 0.07 (Pearson’s r = 0.85). n = 10,795 protein chains. The shaded region of the linear fit represents a 95% confidence interval estimated from 10,000 bootstrap samples.

**Fig. 3. Architectural details.**
a, Evoformer block. Arrows show the information flow. The shape of the arrays is shown in parentheses. b, The pair representation interpreted as directed edges in a graph. c, Triangle multiplicative update and triangle self-attention. The circles represent residues. Entries in the pair representation are illustrated as directed edges and in each diagram, the edge being updated is ij. d, Structure module including Invariant point attention (IPA) module. The single representation is a copy of the first row of the MSA representation. e, Residue gas: a representation of each residue as one free-floating rigid body for the backbone (blue triangles) and χ angles for the side chains (green circles). The corresponding atomic structure is shown below. f, Frame aligned point error (FAPE). Green, predicted structure; grey, true structure; (R_k, t_k), frames; x_i, atom positions.

**Fig. 4. Interpreting the neural network.**
a, Ablation results on two target sets: the CASP14 set of domains (n = 87 protein domains) and the PDB test set of chains with template coverage of ≤30% at 30% identity (n = 2,261 protein chains). Domains are scored with GDT and chains are scored with lDDT-Cα. The ablations are reported as a difference compared with the average of the three baseline seeds. Means (points) and 95% bootstrap percentile intervals (error bars) are computed using bootstrap estimates of 10,000 samples. b, Domain GDT trajectory over 4 recycling iterations and 48 Evoformer blocks on CASP14 targets LmrP (T1024) and Orf8 (T1064) where D1 and D2 refer to the individual domains as defined by the CASP assessment. Both T1024 domains obtain the correct structure early in the network, whereas the structure of T1064 changes multiple times and requires nearly the full depth of the network to reach the final structure. Note, 48 Evoformer blocks comprise one recycling iteration.

**Fig. 5. Effect of MSA depth and cross-chain contacts.**
a, Backbone accuracy (lDDT-Cα) for the redundancy-reduced set of the PDB after our training data cut-off, restricting to proteins in which at most 25% of the long-range contacts are between different heteromer chains. We further consider two groups of proteins based on template coverage at 30% sequence identity: covering more than 60% of the chain (n = 6,743 protein chains) and covering less than 30% of the chain (n = 1,596 protein chains). MSA depth is computed by counting the number of non-gap residues for each position in the MSA (using the N_eff weighting scheme; see Methods for details) and taking the median across residues. The curves are obtained through Gaussian kernel average smoothing (window size is 0.2 units in log₁₀(N_eff)); the shaded area is the 95% confidence interval estimated using bootstrap of 10,000 samples. b, An intertwined homotrimer (PDB 6SK0) is correctly predicted without input stoichiometry and only a weak template (blue is predicted and green is experimental).

See this image and copyright information in PMC

Comment in

Protein-structure prediction revolutionized.
AlQuraishi M. AlQuraishi M. Nature. 2021 Aug;596(7873):487-488. doi: 10.1038/d41586-021-02265-4. Nature. 2021. PMID: 34426694 No abstract available.
Solution of the protein structure prediction problem at last: crucial innovations and next frontiers.
Agard DA, Bowman GR, DeGrado W, Dokholyan NV, Zhou HX. Agard DA, et al. Fac Rev. 2022 Dec 14;11:38. doi: 10.12703/r-01-0000020. eCollection 2022. Fac Rev. 2022. PMID: 36644294 Free PMC article.

Cited by

GLP-1 and its derived peptides mediate pain relief through direct TRPV1 inhibition without affecting thermoregulation.
Go EJ, Hwang SM, Jo H, Rahman MM, Park J, Lee JY, Jo YY, Lee BG, Jung Y, Berta T, Kim YH, Park CK. Go EJ, et al. Exp Mol Med. 2024 Nov 1. doi: 10.1038/s12276-024-01342-8. Online ahead of print. Exp Mol Med. 2024. PMID: 39482537
MYH1 deficiency disrupts outer hair cell electromotility, resulting in hearing loss.
Jung J, Joo SY, Min H, Roh JW, Kim KA, Ma JH, Rim JH, Kim JA, Kim SJ, Jang SH, Koh YI, Kim HY, Lee H, Kim BC, Gee HY, Bok J, Choi JY, Seong JK. Jung J, et al. Exp Mol Med. 2024 Nov 1. doi: 10.1038/s12276-024-01338-4. Online ahead of print. Exp Mol Med. 2024. PMID: 39482536
Insights into genomic sequence diversity of the SAG surface antigen superfamily in geographically diverse Eimeria tenella isolates.
Kiang AL, Loo SS, Mat-Isa MN, Ng CL, Blake DP, Wan KL. Kiang AL, et al. Sci Rep. 2024 Nov 1;14(1):26251. doi: 10.1038/s41598-024-77580-7. Sci Rep. 2024. PMID: 39482455
Self-assembly antimicrobial peptide for treatment of multidrug-resistant bacterial infection.
Ma X, Yang N, Mao R, Hao Y, Li Y, Guo Y, Teng D, Huang Y, Wang J. Ma X, et al. J Nanobiotechnology. 2024 Oct 30;22(1):668. doi: 10.1186/s12951-024-02896-5. J Nanobiotechnology. 2024. PMID: 39478570 Free PMC article.
GASIDN: identification of sub-Golgi proteins with multi-scale feature fusion.
Sui J, Chen J, Chen Y, Iwamori N, Sun J. Sui J, et al. BMC Genomics. 2024 Oct 30;25(1):1019. doi: 10.1186/s12864-024-10954-3. BMC Genomics. 2024. PMID: 39478465 Free PMC article.

See all "Cited by" articles

References

1. Thompson MC, Yeates TO, Rodriguez JA. Advances in methods for atomic resolution macromolecular structure determination. F1000Res. 2020;9:667. doi: 10.12688/f1000research.25097.1. - DOI - PMC - PubMed
1. Bai X-C, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem. Sci. 2015;40:49–57. doi: 10.1016/j.tibs.2014.10.005. - DOI - PubMed
1. Jaskolski M, Dauter Z, Wlodawer A. A brief history of macromolecular crystallography, illustrated by a family tree and its Nobel fruits. FEBS J. 2014;281:3985–4009. doi: 10.1111/febs.12796. - DOI - PMC - PubMed
1. Wüthrich K. The way to NMR structures of proteins. Nat. Struct. Biol. 2001;8:923–925. doi: 10.1038/nsb1101-923. - DOI - PubMed
1. wwPDB Consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2018;47:D520–D528. doi: 10.1093/nar/gky949. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect
- The Lens - Patent Citations
Miscellaneous
- NCI CPTAC Assay Portal

[1] Thompson MC, Yeates TO, Rodriguez JA. Advances in methods for atomic resolution macromolecular structure determination. F1000Res. 2020;9:667. doi: 10.12688/f1000research.25097.1. - DOI - PMC - PubMed

[2] Thompson MC, Yeates TO, Rodriguez JA. Advances in methods for atomic resolution macromolecular structure determination. F1000Res. 2020;9:667. doi: 10.12688/f1000research.25097.1. - DOI - PMC - PubMed

[3] Bai X-C, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem. Sci. 2015;40:49–57. doi: 10.1016/j.tibs.2014.10.005. - DOI - PubMed

[4] Bai X-C, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem. Sci. 2015;40:49–57. doi: 10.1016/j.tibs.2014.10.005. - DOI - PubMed

[5] Jaskolski M, Dauter Z, Wlodawer A. A brief history of macromolecular crystallography, illustrated by a family tree and its Nobel fruits. FEBS J. 2014;281:3985–4009. doi: 10.1111/febs.12796. - DOI - PMC - PubMed

[6] Jaskolski M, Dauter Z, Wlodawer A. A brief history of macromolecular crystallography, illustrated by a family tree and its Nobel fruits. FEBS J. 2014;281:3985–4009. doi: 10.1111/febs.12796. - DOI - PMC - PubMed

[7] Wüthrich K. The way to NMR structures of proteins. Nat. Struct. Biol. 2001;8:923–925. doi: 10.1038/nsb1101-923. - DOI - PubMed

[8] Wüthrich K. The way to NMR structures of proteins. Nat. Struct. Biol. 2001;8:923–925. doi: 10.1038/nsb1101-923. - DOI - PubMed

[9] wwPDB Consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2018;47:D520–D528. doi: 10.1093/nar/gky949. - DOI - PMC - PubMed

[10] wwPDB Consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2018;47:D520–D528. doi: 10.1093/nar/gky949. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Highly accurate protein structure prediction with AlphaFold

Affiliations

Highly accurate protein structure prediction with AlphaFold

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous