Abstract
Molecular structures and functions of the majority of proteins across different species are yet to be identified. Much needed functional annotation of these gene products often benefits from the knowledge of protein–ligand interactions. Towards this goal, we developed eFindSite, an improved version of FINDSITE, designed to more efficiently identify ligand binding sites and residues using only weakly homologous templates. It employs a collection of effective algorithms, including highly sensitive meta-threading approaches, improved clustering techniques, advanced machine learning methods and reliable confidence estimation systems. Depending on the quality of target protein structures, eFindSite outperforms geometric pocket detection algorithms by 15–40 % in binding site detection and by 5–35 % in binding residue prediction. Moreover, compared to FINDSITE, it identifies 14 % more binding residues in the most difficult cases. When multiple putative binding pockets are identified, the ranking accuracy is 75–78 %, which can be further improved by 3–4 % by including auxiliary information on binding ligands extracted from biomedical literature. As a first across-genome application, we describe structure modeling and binding site prediction for the entire proteome of Escherichia coli. Carefully calibrated confidence estimates strongly indicate that highly reliable ligand binding predictions are made for the majority of gene products, thus eFindSite holds a significant promise for large-scale genome annotation and drug development projects. eFindSite is freely available to the academic community at http://www.brylinski.org/efindsite.













Similar content being viewed by others
References
Hoehndorf R, Kelso J, Herre H (2009) The ontology of biological sequences. BMC Bioinformatics 10:377
Stevens R, Goble CA, Bechhofer S (2000) Ontology-based knowledge representation for bioinformatics. Brief Bioinformatics 1(4):398–414
Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29
Harris MA et al (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res, 32(Database issue): D258–61
Lybrand TP (2002) In: Naray-Szabo G, Warshel A (eds) Protein-ligand interactions, in computational approaches to biochemical reactivity. Springer, Boston, pp 363–374
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46
Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics 38(3):95–109
Juncker AS et al (2009) Sequence-based feature prediction and annotation of proteins. Genome Biol 10(2):206
Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10(2):207
Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6:33
Hwang S, Gou Z, Kuznetsov IB (2007) DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5):634–636
Chen P, Li J (2010) Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics 11:402
Chen XW, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5):585–591
Soding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960
Lopez G et al (2011) Firestar—advances in the prediction of functionally important residues. Nucleic Acids Res 39(Web Server issue): W235–41
Lord PW et al (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283
Schnoes AM et al (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5(12):e1000605
Zhang QC et al (2011) PredUs: a web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res 39(Web Server issue): W283–7
Brylinski M et al (2007) Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol 3(5):e94
Brylinski M et al (2007) Localization of ligand binding site in proteins identified in silico. J Mol Model 13(6–7):665–675
Dudev M, Lim C (2007) Discovering structural motifs using a structural alphabet: application to magnesium-binding sites. BMC Bioinformatics 8:106
Laskowski RA (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–30, 307–8
Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7(9):1884–1897
Levitt DG, Banaszak LJ (1992) POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234
Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC Struct Biol 6:19
Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10:168
Zhu H, Pisabarro MT (2011) MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets. Bioinformatics 27(3):351–358
Huang B (2009) MetaPocket: a meta approach to improve protein ligand binding site prediction. OMICS 13(4):325–330
Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10(4):378–391
Wass MN, Kelley LA, Sternberg MJ (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 38(Web Server issue): W469–73
Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci U S A 105(1):129–134
Roche DB, Tetchner SJ, McGuffin LJ (2011) FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins. BMC Bioinformatics 12:160
Brylinski M, Skolnick J (2011) FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level. Proteins 79(3):735–751
Dror I et al (2011) Predicting nucleic acid binding interfaces from structural models of proteins. Proteins
Mukherjee S, Zhang Y (2011) Protein-protein complex structure predictions by multimeric threading and template recombination. Structure 19(7):955–966
Tyagi M et al (2012) Homology inference of protein–protein interactions via conserved binding sites. PLoS ONE 7(1):e28896
Pandit SB, Skolnick J (2008) Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinformatics 9:531
Ortiz AR, Strauss CE, Olmea O (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11(11):2606–2621
Russell RB, Sasieni PD, Sternberg MJ (1998) Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 282(4):903–918
Brylinski M, Skolnick J (2010) Comparison of structure-based and threading-based approaches to protein functional annotation. Proteins 78(1):118–134
Laurie AT, Jackson RM (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr Protein Pept Sci 7(5):395–406
Li YY, An J, Jones SJ (2006) A large-scale computational approach to drug repositioning. Genome Inform 17(2):239–247
Li YY, An J, Jones SJ (2011) A computational approach to finding novel targets for existing drugs. PLoS Comput Biol 7(9):e1002139
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Brylinski M, Lingam D (2012) eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures. PLoS ONE 7(11):e50200
Brylinski M, Feinstein WP (2012) Setting up a meta-threading pipeline for high-throughput structural bioinformatics: eThread software distribution, walkthrough and resource profiling. J Comput Sci Syst Biol 6(1):001–010
Wallach I, Lilien R (2009) The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 25(5):615–620
Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591
Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57(4):702–710
Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28(1):235–242
Bindewald E, Skolnick J (2005) A scoring function for docking ligands to low-resolution protein structures. J Comput Chem 26(4):374–383
Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA 106(10):3770–3775
Sadreyev R, Grishin N (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326(1):317–336
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Bucher P et al (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20(1):3–23
Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14):1761–1767
Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107
Zhou H, Zhou Y (2005) SPARKS 2 and SP3 servers in CASP6. Proteins 61(Suppl 7):152–156
Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358(6381):86–89
Tanimoto TT (1958) An elementary mathematical theory of classification and prediction, in IBM Internal Report
Guha R et al (2006) The blue obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3):991–998
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, San Francisco
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5(4):725–738
Soga S et al (2007) Use of amino acid composition to predict ligand-binding sites. J Chem Inf Model 47(2):400–406
Marti-Renom MA et al (2007) The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 8(Suppl 4):S4
Liu T, Altman RB (2009) Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 9:72
Kawabata T (2010) Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78(5):1195–1211
Zhang Z et al (2011) Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics 27(15):2083–2088
Blattner FR et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277(5331):1453–1462
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815
Pandit SB, Zhang Y, Skolnick J (2006) TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91(11):4180–4190
Brylinski M, Skolnick J (2007) What is the relationship between the global structures of apo and holo proteins? Proteins 70(2):363–377
Chen X, Liu M, Gilson MK (2001) BindingDB: a web-accessible molecular recognition database. Comb Chem High Throughput Screen 4(8):719–725
Wang Y et al (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 37(Web Server issue): W623–33
Wishart DS et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue): D668–72
Jacquet E, Parmeggiani A (1988) Structure-function relationships in the GTP binding domain of EF-Tu: mutation of Val20, the residue homologous to position 12 in p21. EMBO J 7(9):2861–2867
Weijland A et al (1993) Asparagine-135 of elongation factor Tu is a crucial residue for the folding of the guanine nucleotide binding pocket. FEBS Lett 330(3):334–338
Gumusel F et al (1990) Mutagenesis of the NH2-terminal domain of elongation factor Tu. Biochim Biophys Acta 1050(1–3):215–221
Stebbins JW et al (1992) Arginine 54 in the active site of Escherichia coli aspartate transcarbamoylase is critical for catalysis: a site-specific mutagenesis, NMR, and X-ray crystallographic study. Protein Sci 1(11):1435–1446
Waldrop GL et al (1992) The contribution of threonine 55 to catalysis in aspartate transcarbamoylase. Biochemistry 31(28):6592–6597
Jin L, Stec B, Kantrowitz ER (2000) A cis-proline to alanine mutant of E. coli aspartate transcarbamoylase: kinetic studies and three-dimensional crystal structures. Biochemistry 39(27):8058–8066
Kitano H (2002) Systems biology: a brief overview. Science 295(5560):1662–1664
Xue L et al (2003) Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci 43(4):1151–1157
Willett P (1998) Chemical similarity searching. J Chem Inf Model 38:983–996
Acknowledgments
This study was supported by the Louisiana Board of Regents through the Board of Regents Support Fund [contract LEQSF(2012–15)-RD-A-05] and Oak Ridge Associated Universities (ORAU) through the 2012 Ralph E. Powe Junior Faculty Enhancement Award. Portions of this research were conducted with high performance computational resources provided by Louisiana State University (http://www.hpc.lsu.edu).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Molecular fingerprints are bit strings that represent the structural and chemical features of organic compounds (see Daylight manual for details: http://www.daylight.com/dayhtml/doc/theory/index.pdf). Tanimoto coefficient is the most popular measure to quantify the similarity of two sets of bits (e.g. molecular fingerprints). Classical Tanimoto coefficient (TC) [60] is defined as:
where a is the count of bits on in the 1st string but not in the 2nd string, b is the count of bits on in the 2nd string but not in the 1st string, and c is the count of the bits on in both strings.
In addition to the classical Tanimoto coefficient, the overlap between two molecular fingerprints can be measured by the average Tanimoto coefficient (aveTC) [84]:
where TC′ is the Tanimoto coefficient calculated for bit positions set off rather than set on.
Furthermore, a version of the Tanimoto coefficient for continuous variables (conTC) [85] was developed:
where x pi is the i-th descriptor of a fingerprint profile and x ci is the i-th descriptor of a query compound. The fingerprint profile is constructed from individual fingerprints for a set of compounds, e.g. template-bound ligands that were used to identify a putative binding site in the target structure.
Rights and permissions
About this article
Cite this article
Brylinski, M., Feinstein, W.P. eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands. J Comput Aided Mol Des 27, 551–567 (2013). https://doi.org/10.1007/s10822-013-9663-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-013-9663-5