Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Feb 22;56(2):275-85.
doi: 10.1021/acs.jcim.5b00555. Epub 2016 Jan 19.

Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses

Affiliations

Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses

Alex M Clark et al. J Chem Inf Model. .

Abstract

Bayesian models constructed from structure-derived fingerprints have been a popular and useful method for drug discovery research when applied to bioactivity measurements that can be effectively classified as active or inactive. The results can be used to rank candidate structures according to their probability of activity, and this ranking benefits from the high degree of interpretability when structure-based fingerprints are used, making the results chemically intuitive. Besides selecting an activity threshold, building a Bayesian model is fast and requires few or no parameters or user intervention. The method also does not suffer from such acute overtraining problems as quantitative structure-activity relationships or quantitative structure-property relationships (QSAR/QSPR). This makes it an approach highly suitable for automated workflows that are independent of user expertise or prior knowledge of the training data. We now describe a new method for creating a composite group of Bayesian models to extend the method to work with multiple states, rather than just binary. Incoming activities are divided into bins, each covering a mutually exclusive range of activities. For each of these bins, a Bayesian model is created to model whether or not the compound belongs in the bin. Analyzing putative molecules using the composite model involves making a prediction for each bin and examining the relative likelihood for each assignment, for example, highest value wins. The method has been evaluated on a collection of hundreds of data sets extracted from ChEMBL v20 and validated data sets for ADME/Tox and bioactivity.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing financial interests: S.E. is a consultant for Collaborative Drug Discovery, Inc. A.M.C. is the founder of Molecular Materials Informatics, Inc.

Figures

Figure 1
Figure 1
Visual example of the binning system: Four groups of molecular structures are divided according to an activity boundary. For each bin, a conventional two-state Laplacian-corrected naïve Bayesian model is constructed using in bin versus not in bin as the classifier.
Figure 2
Figure 2
Demonstration of separation of structure activity groups by population integral (red line) or by boundaries between different categories of structure–activity relationships (green line).
Figure 3
Figure 3
Example of a data set containing activities against aminopeptidase N-protein with 651 measurements, of which 65 were reserved for use in the training set. The correctness matrix is shown in (a), while (b) shows enrichment statistics, and (c–i) show examples of molecular predictions from the test set.
Figure 4
Figure 4
Three examples of recall rates: (a) Caspase (980 rows), (b) Death kinase (926 rows), and (c) dihydrofolate reductase (1056 rows). In each case, 10% of the structures were retained for use as the testing set.
Figure 5
Figure 5
Unsupervised model building based on 1843 data sets extracted from ChEMBL v20. Results are divided into bin sizes (columns). Each point corresponds to the ratio of correctly predicted bins versus chance of random guessing (enrichment), with a purple line indicating the null hypothesis. The average and standard deviation are marked on the Y-axis. Training set size is shown on the X-axis. The testing sets were made up of 10% of each total data set.
Figure 6
Figure 6
Three examples of correlating raw Laplacian-modified naïve Bayesian predictions with activity: (a) solubility (test set = 649, R2 = 0.513), (b) mouse epoxide hydroxylase (test set = 328, R2 = 0.582), and (c) activity against Chagas disease (test set = 371, R2 = 0.393).
Figure 7
Figure 7
Analysis for three data sets: aqueous solubility, mouse hydrolase epoxide, and Chagas disease. The correctness matrix and enrichment statistics results for the training sets (a, c, e) are shown on the top; test sets are shown below (b, d, f).

Similar articles

Cited by

References

    1. Litterman N. K.; Lipinski C. A.; Bunin B. A.; Ekins S. Computational Prediction and Validation of an Expert’s Evaluation of Chemical Probes. J. Chem. Inf. Model. 2014, 54, 2996–3004. 10.1021/ci500445u. - DOI - PMC - PubMed
    1. Ekins S.; Pottorf R.; Reynolds R. C.; Williams A. J.; Clark A. M.; Freundlich J. S. Looking back to the future: predicting in vivo efficacy of small molecules versus Mycobacterium tuberculosis. J. Chem. Inf. Model. 2014, 54, 1070–82. 10.1021/ci500077v. - DOI - PMC - PubMed
    1. Ekins S.; Freundlich J. S.; Reynolds R. C. Are Bigger Data Sets Better for Machine Learning? Fusing Single-Point and Dual-Event Dose Response Data for Mycobacterium tuberculosis. J. Chem. Inf. Model. 2014, 54, 2157–65. 10.1021/ci500264r. - DOI - PMC - PubMed
    1. Ekins S.; Freundlich J. S.; Hobrath J. V.; Lucile White E.; Reynolds R. C. Combining computational methods for hit to lead optimization in Mycobacterium tuberculosis drug discovery. Pharm. Res. 2014, 31, 414–35. 10.1007/s11095-013-1172-7. - DOI - PMC - PubMed
    1. Ekins S.; Casey A. C.; Roberts D.; Parish T.; Bunin B. A. Bayesian models for screening and TB Mobile for target inference with Mycobacterium tuberculosis. Tuberculosis (Oxford, U. K.) 2014, 94, 162–9. 10.1016/j.tube.2013.12.001. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources