eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands

Brylinski, Michal; Feinstein, Wei P.

doi:10.1007/s10822-013-9663-5

eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands

Published: 10 July 2013

Volume 27, pages 551–567, (2013)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

1247 Accesses
61 Citations
Explore all metrics

Abstract

Molecular structures and functions of the majority of proteins across different species are yet to be identified. Much needed functional annotation of these gene products often benefits from the knowledge of protein–ligand interactions. Towards this goal, we developed eFindSite, an improved version of FINDSITE, designed to more efficiently identify ligand binding sites and residues using only weakly homologous templates. It employs a collection of effective algorithms, including highly sensitive meta-threading approaches, improved clustering techniques, advanced machine learning methods and reliable confidence estimation systems. Depending on the quality of target protein structures, eFindSite outperforms geometric pocket detection algorithms by 15–40 % in binding site detection and by 5–35 % in binding residue prediction. Moreover, compared to FINDSITE, it identifies 14 % more binding residues in the most difficult cases. When multiple putative binding pockets are identified, the ranking accuracy is 75–78 %, which can be further improved by 3–4 % by including auxiliary information on binding ligands extracted from biomedical literature. As a first across-genome application, we describe structure modeling and binding site prediction for the entire proteome of Escherichia coli. Carefully calibrated confidence estimates strongly indicate that highly reliable ligand binding predictions are made for the majority of gene products, thus eFindSite holds a significant promise for large-scale genome annotation and drug development projects. eFindSite is freely available to the academic community at http://www.brylinski.org/efindsite.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Predicted binding site information improves model ranking in protein docking using experimental and computer-generated target structures

Article Open access 23 November 2015

Functional protein mining with conformal guarantees

Article Open access 02 January 2025

ProQ3: Improved model quality assessments using Rosetta energy terms

Article Open access 04 October 2016

References

Hoehndorf R, Kelso J, Herre H (2009) The ontology of biological sequences. BMC Bioinformatics 10:377
Article Google Scholar
Stevens R, Goble CA, Bechhofer S (2000) Ontology-based knowledge representation for bioinformatics. Brief Bioinformatics 1(4):398–414
Article CAS Google Scholar
Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29
Article CAS Google Scholar
Harris MA et al (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res, 32(Database issue): D258–61
Google Scholar
Lybrand TP (2002) In: Naray-Szabo G, Warshel A (eds) Protein-ligand interactions, in computational approaches to biochemical reactivity. Springer, Boston, pp 363–374
Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46
Article CAS Google Scholar
Zhang J et al (2011) The impact of next-generation sequencing on genomics. J Genet Genomics 38(3):95–109
Article Google Scholar
Juncker AS et al (2009) Sequence-based feature prediction and annotation of proteins. Genome Biol 10(2):206
Article Google Scholar
Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10(2):207
Article Google Scholar
Ahmad S, Sarai A (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 6:33
Article Google Scholar
Hwang S, Gou Z, Kuznetsov IB (2007) DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics 23(5):634–636
Article CAS Google Scholar
Chen P, Li J (2010) Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics 11:402
Article CAS Google Scholar
Chen XW, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25(5):585–591
Article Google Scholar
Soding J (2005) Protein homology detection by HMM–HMM comparison. Bioinformatics 21(7):951–960
Article Google Scholar
Lopez G et al (2011) Firestar—advances in the prediction of functionally important residues. Nucleic Acids Res 39(Web Server issue): W235–41
Google Scholar
Lord PW et al (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19(10):1275–1283
Article CAS Google Scholar
Schnoes AM et al (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5(12):e1000605
Article Google Scholar
Zhang QC et al (2011) PredUs: a web server for predicting protein interfaces using structural neighbors. Nucleic Acids Res 39(Web Server issue): W283–7
Google Scholar
Brylinski M et al (2007) Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol 3(5):e94
Article Google Scholar
Brylinski M et al (2007) Localization of ligand binding site in proteins identified in silico. J Mol Model 13(6–7):665–675
Article CAS Google Scholar
Dudev M, Lim C (2007) Discovering structural motifs using a structural alphabet: application to magnesium-binding sites. BMC Bioinformatics 8:106
Article Google Scholar
Laskowski RA (1995) SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–30, 307–8
Google Scholar
Liang J, Edelsbrunner H, Woodward C (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7(9):1884–1897
Article CAS Google Scholar
Levitt DG, Banaszak LJ (1992) POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234
Article CAS Google Scholar
Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC Struct Biol 6:19
Article Google Scholar
Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10:168
Article Google Scholar
Zhu H, Pisabarro MT (2011) MSPocket: an orientation-independent algorithm for the detection of ligand binding pockets. Bioinformatics 27(3):351–358
Article CAS Google Scholar
Huang B (2009) MetaPocket: a meta approach to improve protein ligand binding site prediction. OMICS 13(4):325–330
Article CAS Google Scholar
Skolnick J, Brylinski M (2009) FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10(4):378–391
Article CAS Google Scholar
Wass MN, Kelley LA, Sternberg MJ (2010) 3DLigandSite: predicting ligand-binding sites using similar structures. Nucleic Acids Res 38(Web Server issue): W469–73
Google Scholar
Brylinski M, Skolnick J (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci U S A 105(1):129–134
Article CAS Google Scholar
Roche DB, Tetchner SJ, McGuffin LJ (2011) FunFOLD: an improved automated method for the prediction of ligand binding residues using 3D models of proteins. BMC Bioinformatics 12:160
Article CAS Google Scholar
Brylinski M, Skolnick J (2011) FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level. Proteins 79(3):735–751
Article CAS Google Scholar
Dror I et al (2011) Predicting nucleic acid binding interfaces from structural models of proteins. Proteins
Mukherjee S, Zhang Y (2011) Protein-protein complex structure predictions by multimeric threading and template recombination. Structure 19(7):955–966
Article CAS Google Scholar
Tyagi M et al (2012) Homology inference of protein–protein interactions via conserved binding sites. PLoS ONE 7(1):e28896
Article CAS Google Scholar
Pandit SB, Skolnick J (2008) Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinformatics 9:531
Article Google Scholar
Ortiz AR, Strauss CE, Olmea O (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11(11):2606–2621
Article CAS Google Scholar
Russell RB, Sasieni PD, Sternberg MJ (1998) Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 282(4):903–918
Article CAS Google Scholar
Brylinski M, Skolnick J (2010) Comparison of structure-based and threading-based approaches to protein functional annotation. Proteins 78(1):118–134
Article CAS Google Scholar
Laurie AT, Jackson RM (2006) Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Curr Protein Pept Sci 7(5):395–406
Article CAS Google Scholar
Li YY, An J, Jones SJ (2006) A large-scale computational approach to drug repositioning. Genome Inform 17(2):239–247
CAS Google Scholar
Li YY, An J, Jones SJ (2011) A computational approach to finding novel targets for existing drugs. PLoS Comput Biol 7(9):e1002139
Article CAS Google Scholar
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976
Article CAS Google Scholar
Brylinski M, Lingam D (2012) eThread: a highly optimized machine learning-based approach to meta-threading and the modeling of protein tertiary structures. PLoS ONE 7(11):e50200
Article CAS Google Scholar
Brylinski M, Feinstein WP (2012) Setting up a meta-threading pipeline for high-throughput structural bioinformatics: eThread software distribution, walkthrough and resource profiling. J Comput Sci Syst Biol 6(1):001–010
Google Scholar
Wallach I, Lilien R (2009) The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 25(5):615–620
Article CAS Google Scholar
Wang G, Dunbrack RL Jr (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591
Article CAS Google Scholar
Zhang Y, Skolnick J (2004) Scoring function for automated assessment of protein structure template quality. Proteins 57(4):702–710
Article CAS Google Scholar
Berman HM et al (2000) The protein data bank. Nucleic Acids Res 28(1):235–242
Article CAS Google Scholar
Bindewald E, Skolnick J (2005) A scoring function for docking ligands to low-resolution protein structures. J Comput Chem 26(4):374–383
Article CAS Google Scholar
Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA 106(10):3770–3775
Article CAS Google Scholar
Sadreyev R, Grishin N (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326(1):317–336
Article CAS Google Scholar
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Article CAS Google Scholar
Bucher P et al (1996) A flexible motif search technique based on generalized profiles. Comput Chem 20(1):3–23
Article CAS Google Scholar
Lobley A, Sadowski MI, Jones DT (2009) pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics 25(14):1761–1767
Article CAS Google Scholar
Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci 12(2):95–107
CAS Google Scholar
Zhou H, Zhou Y (2005) SPARKS 2 and SP3 servers in CASP6. Proteins 61(Suppl 7):152–156
Article CAS Google Scholar
Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358(6381):86–89
Article CAS Google Scholar
Tanimoto TT (1958) An elementary mathematical theory of classification and prediction, in IBM Internal Report
Guha R et al (2006) The blue obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3):991–998
Article CAS Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, San Francisco
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article CAS Google Scholar
Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5(4):725–738
Article CAS Google Scholar
Soga S et al (2007) Use of amino acid composition to predict ligand-binding sites. J Chem Inf Model 47(2):400–406
Article CAS Google Scholar
Marti-Renom MA et al (2007) The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 8(Suppl 4):S4
Article Google Scholar
Liu T, Altman RB (2009) Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 9:72
Article Google Scholar
Kawabata T (2010) Detection of multiscale pockets on protein surfaces using mathematical morphology. Proteins 78(5):1195–1211
Article CAS Google Scholar
Zhang Z et al (2011) Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics 27(15):2083–2088
Article CAS Google Scholar
Blattner FR et al (1997) The complete genome sequence of Escherichia coli K-12. Science 277(5331):1453–1462
Article CAS Google Scholar
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815
Article CAS Google Scholar
Pandit SB, Zhang Y, Skolnick J (2006) TASSER-Lite: an automated tool for protein comparative modeling. Biophys J 91(11):4180–4190
Article CAS Google Scholar
Brylinski M, Skolnick J (2007) What is the relationship between the global structures of apo and holo proteins? Proteins 70(2):363–377
Article Google Scholar
Chen X, Liu M, Gilson MK (2001) BindingDB: a web-accessible molecular recognition database. Comb Chem High Throughput Screen 4(8):719–725
Article CAS Google Scholar
Wang Y et al (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 37(Web Server issue): W623–33
Google Scholar
Wishart DS et al (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue): D668–72
Google Scholar
Jacquet E, Parmeggiani A (1988) Structure-function relationships in the GTP binding domain of EF-Tu: mutation of Val20, the residue homologous to position 12 in p21. EMBO J 7(9):2861–2867
CAS Google Scholar
Weijland A et al (1993) Asparagine-135 of elongation factor Tu is a crucial residue for the folding of the guanine nucleotide binding pocket. FEBS Lett 330(3):334–338
Article CAS Google Scholar
Gumusel F et al (1990) Mutagenesis of the NH2-terminal domain of elongation factor Tu. Biochim Biophys Acta 1050(1–3):215–221
CAS Google Scholar
Stebbins JW et al (1992) Arginine 54 in the active site of Escherichia coli aspartate transcarbamoylase is critical for catalysis: a site-specific mutagenesis, NMR, and X-ray crystallographic study. Protein Sci 1(11):1435–1446
Article CAS Google Scholar
Waldrop GL et al (1992) The contribution of threonine 55 to catalysis in aspartate transcarbamoylase. Biochemistry 31(28):6592–6597
Article CAS Google Scholar
Jin L, Stec B, Kantrowitz ER (2000) A cis-proline to alanine mutant of E. coli aspartate transcarbamoylase: kinetic studies and three-dimensional crystal structures. Biochemistry 39(27):8058–8066
Article CAS Google Scholar
Kitano H (2002) Systems biology: a brief overview. Science 295(5560):1662–1664
Article CAS Google Scholar
Xue L et al (2003) Design and evaluation of a molecular fingerprint involving the transformation of property descriptor values into a binary classification scheme. J Chem Inf Comput Sci 43(4):1151–1157
Article CAS Google Scholar
Willett P (1998) Chemical similarity searching. J Chem Inf Model 38:983–996
Article CAS Google Scholar

Download references

Acknowledgments

This study was supported by the Louisiana Board of Regents through the Board of Regents Support Fund [contract LEQSF(2012–15)-RD-A-05] and Oak Ridge Associated Universities (ORAU) through the 2012 Ralph E. Powe Junior Faculty Enhancement Award. Portions of this research were conducted with high performance computational resources provided by Louisiana State University (http://www.hpc.lsu.edu).

Author information

Authors and Affiliations

Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
Michal Brylinski & Wei P. Feinstein
Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, 70803, USA
Michal Brylinski

Authors

Michal Brylinski
View author publications
You can also search for this author inPubMed Google Scholar
Wei P. Feinstein
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Michal Brylinski.

Appendix

Molecular fingerprints are bit strings that represent the structural and chemical features of organic compounds (see Daylight manual for details: http://www.daylight.com/dayhtml/doc/theory/index.pdf). Tanimoto coefficient is the most popular measure to quantify the similarity of two sets of bits (e.g. molecular fingerprints). Classical Tanimoto coefficient (TC) [60] is defined as:

$$ TC = \frac{c}{a + b + c} $$

(2)

where a is the count of bits on in the 1st string but not in the 2nd string, b is the count of bits on in the 2nd string but not in the 1st string, and c is the count of the bits on in both strings.

In addition to the classical Tanimoto coefficient, the overlap between two molecular fingerprints can be measured by the average Tanimoto coefficient (aveTC) [84]:

$$ aveTC = \frac{{TC + TC^{'} }}{2} $$

(3)

where TC′ is the Tanimoto coefficient calculated for bit positions set off rather than set on.

Furthermore, a version of the Tanimoto coefficient for continuous variables (conTC) [85] was developed:

$$ conTC = \frac{{\sum {x_{pi} x_{ci} } }}{{\sum {x_{pi}^{2} + } \sum {x_{ci}^{2} - \sum {x_{pi} x_{ci} } } }} $$

(4)

where x _pi is the i-th descriptor of a fingerprint profile and x _ci is the i-th descriptor of a query compound. The fingerprint profile is constructed from individual fingerprints for a set of compounds, e.g. template-bound ligands that were used to identify a putative binding site in the target structure.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brylinski, M., Feinstein, W.P. eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands. J Comput Aided Mol Des 27, 551–567 (2013). https://doi.org/10.1007/s10822-013-9663-5

Download citation

Received: 06 April 2013
Accepted: 01 July 2013
Published: 10 July 2013
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10822-013-9663-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

eFindSite: Improved prediction of ligand binding sites in protein models using meta-threading, machine learning and auxiliary ligands

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Predicted binding site information improves model ranking in protein docking using experimental and computer-generated target structures

Functional protein mining with conformal guarantees

ProQ3: Improved model quality assessments using Rosetta energy terms

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now