Abstract
The use of macromolecular structures is widespread for a variety of applications, from teaching protein structure principles all the way to ligand optimization in drug development. Applying data mining techniques on these experimentally determined structures requires a highly uniform, standardized structural data source. The Protein Data Bank (PDB) has evolved over the years toward becoming the standard resource for macromolecular structures. However, the process selecting the data most suitable for specific applications is still very much based on personal preferences and understanding of the experimental techniques used to obtain these models. In this chapter, we will first explain the challenges with data standardization, annotation, and uniformity in the PDB entries determined by X-ray crystallography. We then discuss the specific effect that crystallographic data quality and model optimization methods have on structural models and how validation tools can be used to make informed choices. We also discuss specific advantages of using the PDB_REDO databank as a resource for structural data. Finally, we will provide guidelines on how to select the most suitable protein structure models for detailed analysis and how to select a set of structure models suitable for data mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blundell T, Carney D, Gardner S et al (1988) Knowledge-based protein modelling and design. Eur J Biochem 172(3):513–520
Kier LB (1967) Molecular orbital calculation of preferred conformations of acetylcholine, muscarine, and muscarone. Mol Pharmacol 3(5):487–494
Ramachandran GN, Ramakrishnan C, Sasisekharan V (1963) Stereochemistry of polypeptide chain configurations. J Mol Biol 7(1):95–99
Read R, Adams P, Arendall W et al (2011) A new generation of crystallographic validation tools for the Protein Data Bank. Structure 19(10):1395–1412
Bernstein FC, Koetzle TF, Williams GJ et al (1977) The protein data bank. Eur J Biochem 80(2):319–324
Bank PD (1971) Protein Data Bank. Nat New Biol 233:223
Güntert P (2009) Automated structure determination from NMR spectra. Eur Biophys J 38(2):129–143
Joachimiak A (2009) High-throughput crystallography for structural genomics. Curr Opin Struct Biol 19(5):573–584
Montelione G, Nilges M, Bax A et al (2013) Recommendations of the wwPDB NMR Validation Task Force. Structure 21(9):1563–1570
Henderson R, Sali A, Baker M et al (2012) Outcome of the first electron microscopy Validation Task Force meeting. Structure 20(2):205–214
Brünger A (1992) Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355:472–475
Bhat T, Bourne P, Feng Z et al (2001) The PDB data uniformity project. Nucleic Acids Res 29(1):214–218
Westbrook J, Fen Z, Jain S et al (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res 30(1):245–248
Henrick K, Feng Z, Bluhm WF et al (2007) Remediation of the protein data bank archive. Nucleic Acids Res 36(Database):D426–D433
Joosten RP, Vriend G (2007) PDB improvement starts with data deposition. Science 317(5835):195–196
Joosten RP, Joosten K, Murshudov GN, Perrakis A (2012) PDB_REDO: constructive validation, more than just looking for errors. Acta Crystallogr D Biol Crystallogr 68(4):484–496
Joosten RP, Long F, Murshudov GN, Perrakis A (2014) The PDB_REDO server for macromolecular structure model optimization. IUCrJ 1(4):213–220
Ma C, Chang G (2007) Retraction for Ma and Chang, Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli. Proc Natl Acad Sci U S A 104(9):3668
Chang G (2007) Retraction of structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation [J. Mol. Biol. (2003) 330 419–430]. J Mol Biol 369(2):596
Baker EN, Dauter Z, Einspahr H, Weiss MS (2010) In defence of our science—validation now! Acta Crystallogr D Biol Crystallogr 66(D):115
Richardson JS, Prisant MG, Richardson DC (2013) Crystallographic model validation: from diagnosis to healing. Curr Opin Struct Biol 23(5):707–714
Yang H, Guranovic V, Dutta S et al (2004) Automated and accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank. Acta Crystallogr D Biol Crystallogr 60(10):1833–1839
Rupp B (2012) Detection and analysis of unusual features in the structural model and structure-factor data of a birch pollen allergen. Acta Crystallogr Sect F Struct Biol Cryst Commun 68(4):366–376
Jmol: an open-source Java viewer for chemical structures in 3d. http://www.jmol.org/
Schrödinger L (2015) The PyMOL molecular graphics system, version 1.3
McNicholas S, Potterton E, Wilson KS, Noble MEM (2011) Presenting your structures: the CCP4mg molecular-graphics software. Acta Crystallogr D Biol Crystallogr 67(4):386–394
Emsley P, Cowtan K (2004) Coot: model-building tools for molecular graphics. Acta Crystallogr D Biol Crystallogr 60(12):2126–2132
Kleywegt GJ, Harris MR, Zou J-Y et al (2004) The Uppsala electron-density server. Acta Crystallogr D Biol Crystallogr 60(12):2240–2249
Sander C, Schneider R (1993) The HSSP data base of protein structure-sequence alignments. Nucleic Acids Res 21(13):3105
Wang G, Dunbrack RL (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591
Yanover C, Vanetik N, Levitt M et al (2014) Redundancy-weighting for better inference of protein structural features. Bioinformatics 30(16):2295–2301
Miyazawa S, Jernigan RL (1996) Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 256(3):623–644
Miyazawa S, Jernigan RL (1999) Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34(1):49–68
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Berman HM, Henrick K, Nakamura H, Markley JL (2007) The woldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res 35(D):301–303
de Beer TAP, Berka K, Thornton JM, Laskowski RA (2014) PDBsum additions. Nucleic Acids Res 42(D1):D292–D296
Gutmanas A, Oldfield TJ, Patwardhan A et al (2013) The role of structural bioinformatics resources in the era of integrative structural biology. Acta Crystallogr D Biol Crystallogr 69(5):710–721
Joosten RP, Womack T, Vriend G, Bricogne G (2009) Re-refinement from deposited X-ray data can deliver improved models for most PDB entries. Acta Crystallogr D Biol Crystallogr 65(2):176–185
Nabuurs SB, Nederveen AJ, Vranken W et al (2004) DRESS: a database of REfined solution NMR structures. Proteins 55(3):483–486
Nederveen AJ, Doreleijers JF, Vranken W et al (2005) RECOORD: a recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank. Proteins 59(4):662–672
Bernard A, Vranken WF, Bardiaux B et al (2011) Bayesian estimation of NMR restraint potential and weight: a validation on a representative set of protein structures. Proteins 79(5):1525–1537
Hooft RW, Sander C, Vriend G (1997) Objectively judging the quality of a protein structure from a Ramachandran plot. CABIOS 13(4):425–430
Berman HM, Kleywegt GJ, Nakamura H, Markley JL (2013) The future of the protein data bank. Biopolymers 99(3):218–222
Gore S, Velankar S, Kleywegt GJ (2012) Implementing an X-ray validation pipeline for the Protein Data Bank. Acta Crystallogr D Biol Crystallogr 68(4):478–483
Dutta S, Burkhardt K, Young J et al (2009) Data deposition and annotation at the worldwide Protein Data Bank. Mol Biotechnol 42(1):1–13
Berman HM, Kleywegt GJ, Nakamura H, Markley JL (2014) The Protein Data Bank archive as an open data resource. J Comput Aided Mol Des 28(10):1009–1014
Westbrook JD, Fitzgerald PMD (2003) The PDB format, mmCIF formats, and other data formats. In: Bourne PE, Weissig H (eds) Structural bioinformatics. Wiley, Chichester, UK
Bolin JT, Filman DJ, Matthews DA et al (1982) Crystal structures of Escherichia coli and Lactobacillus casei dihydrofolate reductase refined at 1.7 Ǻ resolution. J Biol Chem 257(22):13650–13662
Joosten RP, Chinea G, Kleywegt GJ, Vriend G (2013) Protein three-dimensional structure validation. In: Reedijk J (ed) Comprehensive medicinal chemistry II. Elsevier, Oxford, UK
Dauter Z (2013) Placement of molecules in (not out of) the cell. Acta Crystallogr D Biol Crystallogr 69(1):2–4
Lawson CL, Dutta S, Westbrook JD et al (2008) Representation of viruses in the remediated PDB archive. Acta Crystallogr D Biol Crystallogr 64(8):874–882
Westbrook J, Ito N, Nakamura H et al (2005) PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7):988–992
Berntsen KRM, Vriend G (2014) Anomalies in the refinement of isoleucine. Acta Crystallogr D Biol Crystallogr 70(4):1037–1049
Tickle IJ (2012) Statistical quality indicators for electron-density maps. Acta Crystallogr D Biol Crystallogr 68(4):454–467
Dauter Z, Wlodawer A, Minor W et al (2014) Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining. IUCrJ 1(3):179–193
Rupp B (2010) Scientific inquiry, inference and critical reasoning in the macromolecular crystallography curriculum. J Appl Crystallogr 43(5):1242–1249
Pruett PS, Azzi A, Clark SA et al (2003) The putative catalytic bases have, at most, an accessory role in the mechanism of arginine kinase. J Biol Chem 278(29):26952–26957
Velankar S, Dana JM, Jacobsen J et al (2013) SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res 41(D1):D483–D489
The UniProt Consortium (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42(D1):D191–D198
Evans PR (2011) An introduction to data reduction: space-group determination, scaling and intensity statistics. Acta Crystallogr D Biol Crystallogr 67(4):282–292
Kraft P, Bergamaschi A, Broennimann C et al (2009) Performance of single-photon-counting PILATUS detector modules. J Synchrotron Radiat 16(3):368–375
Domagalski MJ, Zheng H, Zimmerman MD et al (2014) The quality and validation of structures from structural genomics. In: Chen YW (ed) Structural genomics. Humana Press, New York
Karplus PA, Diederichs K (2012) Linking crystallographic model and data quality. Science 336(6084):1030–1033
Evans PR, Murshudov GN (2013) How good are my data and what is the resolution? Acta Crystallogr D Biol Crystallogr 69(7):1204–1214
Read RJ, McCoy AJ (2011) Using SAD data in Phaser. Acta Crystallogr D Biol Crystallogr 67(4):338–344
Liu Q, Dahmane T, Zhang Z et al (2012) Structures from anomalous diffraction of native biological macromolecules. Science 336(6084):1033–1037
Perrakis A, Morris R, Lamzin VS (1999) Automated protein model building combined with iterative structure refinement. Nat Struct Mol Biol 6(5):458–463
Cowtan K (2006) The Buccaneer software for automated model building. 1. Tracing protein chains. Acta Crystallogr D Biol Crystallogr 62(9):1002–1011
Terwilliger T (2004) SOLVE and RESOLVE: automated structure solution, density modification and model building. J Synchrotron Radiat 11(1):49–52
Parkinson G, Vojtechovsky J, Clowney L et al (1996) New parameters for the refinement of nucleic acid-containing structures. Acta Crystallogr D Biol Crystallogr 52(1):57–64
Kleywegt GJ (1996) Use of non-crystallographic symmetry in protein structure refinement. Acta Crystallogr D Biol Crystallogr 52(4):842–857
Smart OS, Womack TO, Flensburg C et al (2012) Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER. Acta Crystallogr D Biol Crystallogr 68(4):368–380
Joosten RP, Joosten K, Cohen SX et al (2011) Automatic rebuilding and optimization of crystallographic structures in the Protein Data Bank. Bioinformatics 27(24):3392–3398
Hamilton WC (1965) Significance tests on the crystallographic R factor. Acta Crystallogr 18(3):502–510
Merritt EA (2012) To B or not to B: a question of resolution? Acta Crystallogr D Biol Crystallogr 68(4):468–477
Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26(2):283–291
Hooft RWW, Vriend G, Sander C, Abola EE (1996) Errors in protein structures. Nature 381:272
Chen VB, Arendall WB, Headd JJ et al (2010) MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr D Biol Crystallogr 66(1):12–21
Jones TA, Zou J-Y, Cowan SW, Kjeldgaard M (1991) Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Crystallogr A 47(2):110–119
Krieger E, Koraimann G, Vriend G (2002) Increasing the precision of comparative models with YASARA NOVA—a self-parameterizing force field. Proteins 47(3):393–402
Joosten RP, te Beek TAH, Krieger E et al (2011) A series of PDB related databases for everyday needs. Nucleic Acids Res 39:D411–D419
Brändén C, Jones TA (1990) Between objectivity and subjectivity. Nature 343:687–689
Touw WG, Baakman C, Black J et al (2014) A series of PDB-related databanks for everyday needs. Nucleic Acids Res 43(Database issue):D364–D368
Pozharski E, Weichenberger CX, Rupp B (2013) Techniques, tools and best practices for ligand electron-density analysis and results from their application to deposited crystal structures. Acta Crystallogr D Biol Crystallogr 69(2):150–167
Cereto-Massagué A, Ojeda MJ, Joosten RP et al (2013) The good, the bad and the dubious: VHELIBS, a validation helper for ligands and binding sites. J Cheminform 5:36
Kleywegt GJ, Harris MR (2007) ValLigURL: a server for ligand-structure comparison and validation. Acta Crystallogr D Biol Crystallogr 63(8):935–938
Danley DE (2006) Crystallization to obtain protein-ligand complexes for structure-aided drug design. Acta Crystallogr D Biol Crystallogr 62(6):569–575
Warren GL, Do TD, Kelley BP et al (2012) Essential considerations for using protein-ligand structures in drug discovery. Drug Discov Today 17(23-24):1270–1281
Hartshorn MJ, Verdonk ML, Chessari G et al (2007) Diverse, high-quality test set for the validation of protein-ligand docking performance. J Med Chem 50(4):726–741
Smart OS, Bricogne G (2015) Achieving high quality ligand chemistry in protein-ligand crystal structures for drug design. In: Scapin G, Patel D, Arnold E (eds) Multifaceted roles of crystallography in modern drug discovery. Springer, New York
Allen FH (2002) The Cambridge Structural Database: a quarter of a million crystal structures and rising. Acta Crystallogr B Struct Sci 58(3):380–388
Weichenberger CX, Pozharski E, Rupp B (2013) Visualizing ligand molecules in twilight electron density. Acta Crystallogr Sect F Struct Biol Cryst Commun 69(2):195–200
Bruno I, Cole J, Kessler M et al (2004) Retrieval of crystallographically-derived molecular geometry information. J Chem Inf Model 44(6):2133–2144
Sehnal D, Svobodová Vařeková R, Pravda L et al (2014) ValidatorDB: database of up-to-date validation results for ligands and non-standard residues from the Protein Data Bank. Nucleic Acids Res 43(Database issue):D369–D375
Lütteke T, Von Der Lieth C-W (2004) pdb-care (PDB CArbohydrate REsidue check): a program to support annotation of complex carbohydrate structures in PDB files. BMC Bioinformatics 5(1):69
Agirre J, Cowtan K (2015) Validation of carbohydrate structures in CCP4 6.5. Comput Crystallogr Newsl 6:10–12
Lutteke T (2004) Carbohydrate Structure Suite (CSS): analysis of carbohydrate 3d structures derived from the PDB. Nucleic Acids Res 33(Database issue):D242–D246
Zheng H, Chordia MD, Cooper DR et al (2013) Validation of metal-binding sites in macromolecular structures with the CheckMyMetal web server. Nat Protoc 9(1):156–170
Andreini C, Cavallaro G, Lorenzini S, Rosato A (2013) MetalPDB: a database of metal sites in biological macromolecular structures. Nucleic Acids Res 41(D1):D312–D319
Hsin K, Sheng Y, Harding MM et al (2008) MESPEUS: a database of the geometry of metal sites in proteins. J Appl Crystallogr 41(5):963–968
Block P, Sotriffer CA, Dramburg I, Klebe G (2006) AffinDB: a freely accessible database of affinities for protein-ligand complexes from the PDB. Nucleic Acids Res 34(90001):D522–D526
Joosten RP, Salzemann J, Bloch V et al (2009) PDB_REDO: automated re-refinement of X-ray structure models in the PDB. J Appl Crystallogr 42(3):376–384
Afonine PV, Grosse-Kunstleve RW, Chen VB et al (2010) Phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics. J Appl Crystallogr 43(4):669–676
Acknowledgments
This work was supported by VIDI grant 723.013.003 from the Netherlands Organisation for Scientific Research (NWO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
van Beusekom, B., Perrakis, A., Joosten, R.P. (2016). Data Mining of Macromolecular Structures. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_6
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_6
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols