Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

doi:10.1021/acscentsci.7b00512

. 2018 Jan 24;4(1):120-131.

doi: 10.1021/acscentsci.7b00512. Epub 2017 Dec 28.

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Marwin H S Segler¹, Thierry Kogej², Christian Tyrchan³, Mark P Waller⁴

Affiliations

¹ Institute of Organic Chemistry & Center for Multiscale Theory and Computation, Westfälische Wilhelms-Universität Münster, 48149 Münster, Germany.
² Hit Discovery, Discovery Sciences, AstraZeneca R&D, Gothenburg, Sweden.
³ Department of Medicinal Chemistry, IMED RIA, AstraZeneca R&D, Gothenburg, Sweden.
⁴ Department of Physics & International Centre for Quantum and Molecular Structures, Shanghai University, Shanghai, China.

PMID: 29392184
PMCID: PMC5785775
DOI: 10.1021/acscentsci.7b00512

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Marwin H S Segler et al. ACS Cent Sci. 2018.

. 2018 Jan 24;4(1):120-131.

doi: 10.1021/acscentsci.7b00512. Epub 2017 Dec 28.

Authors

Marwin H S Segler¹, Thierry Kogej², Christian Tyrchan³, Mark P Waller⁴

Affiliations

¹ Institute of Organic Chemistry & Center for Multiscale Theory and Computation, Westfälische Wilhelms-Universität Münster, 48149 Münster, Germany.
² Hit Discovery, Discovery Sciences, AstraZeneca R&D, Gothenburg, Sweden.
³ Department of Medicinal Chemistry, IMED RIA, AstraZeneca R&D, Gothenburg, Sweden.
⁴ Department of Physics & International Centre for Quantum and Molecular Structures, Shanghai University, Shanghai, China.

PMID: 29392184
PMCID: PMC5785775
DOI: 10.1021/acscentsci.7b00512

Abstract

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active toward a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria), it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

**Figure 1**
Examples of molecules and their SMILES representation. To correctly create smiles, the model has to learn long-term dependencies, for example, to close rings (indicated by numbers) and brackets.

**Figure 2**
(a) Recursively defined RNN. (b) The same RNN, unrolled. The parameters θ (the weight matrices of the neural network) are shared over all time steps.

**Figure 3**
Symbol generation and sampling process. We start with a random seed symbol s₁, here c, which gets converted into a one-hot vector x₁ and input into the model. The model then updates its internal state h₀ to h₁ and outputs y₁, which is the probability distribution over the next symbols. Here, sampling yields s₂ = 1. Converting s₂ to x₂ and feeding it to the model leads to updated hidden state h₂ and output y₂, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state h_i allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.

**Figure 4**
A few randomly selected, generated molecules. Ad = Adamantyl.

**Figure 5**
t-SNE projection of 7 physicochemical descriptors of random molecules from ChEMBL (blue) and molecules generated with the neural network trained on ChEMBL (green), to two unitless dimensions. The distributions of both sets overlap significantly.

**Figure 6**
Epochs of fine-tuning vs ratio of actives.

**Figure 7**
Nearest-neighbor Tanimoto similarity distribution of the generated molecules for 5-HT_2A after n epochs of fine-tuning against the known actives. The generated molecules are distributed over the whole similarity range. Generated molecules with a medium similarity can be interesting for scaffold-hopping.

**Figure 8**
t-SNE plot of the pIC₅₀ > 9 test set (blue) and the *de novo* molecules predicted to be active (green). The language model populates chemical space around the test molecules.

**Figure 9**
Different training strategies on the *Staphylococcus aureus* data set with 1000 training and 6051 test examples. Fine-tuning the pretrained model performs better than training from scratch (lower test loss [cross entropy] is better).

**Figure 10**
Scheme of our *de novo* design cycle. Molecules are generated by the chemical language model and then scored with the target prediction model (TPM). The inactives are filtered out, and the RNN is retrained. Here, the TPM is a machine learning model, but it could also be a robot conducting synthesis and biological assays, or a docking program.

**Figure 11**
Histogram of Levenshtein (string edit) distances of the SMILES of the reproduced molecules to their nearest neighbor in the training set (*Staphylococcus aureus*, model retrained on 50 actives). While in many cases the model makes changes of a few symbols in the SMILES, resembling the typical modifications applied when exploring series of compounds, the distribution of the distances indicates that the RNN also performs more complex changes by introducing larger moieties or generating molecules that are structurally different, but isofunctional to the training set.

**Figure 12**
Violin plot of the nearest-neighbor ECFP4-Tanimoto similarity distribution of the 50 training molecules against the rediscovered molecules in Table 3, entry 2. The distribution suggests that the model has learned to make typical small functional group replacements, but can also reproduce molecules which are not too similar to the training data.

See this image and copyright information in PMC

Cited by

Generative deep learning enables the discovery of a potent and selective RIPK1 inhibitor.
Li Y, Zhang L, Wang Y, Zou J, Yang R, Luo X, Wu C, Yang W, Tian C, Xu H, Wang F, Yang X, Li L, Yang S. Li Y, et al. Nat Commun. 2022 Nov 12;13(1):6891. doi: 10.1038/s41467-022-34692-w. Nat Commun. 2022. PMID: 36371441 Free PMC article.
A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences.
Linder J, Bogard N, Rosenberg AB, Seelig G. Linder J, et al. Cell Syst. 2020 Jul 22;11(1):49-62.e16. doi: 10.1016/j.cels.2020.05.007. Epub 2020 Jun 25. Cell Syst. 2020. PMID: 32711843 Free PMC article.
Active Learning and the Potential of Neural Networks Accelerate Molecular Screening for the Design of a New Molecule Effective against SARS-CoV-2.
Yassine R, Makrem M, Farhat F. Yassine R, et al. Biomed Res Int. 2021 May 25;2021:6696012. doi: 10.1155/2021/6696012. eCollection 2021. Biomed Res Int. 2021. PMID: 34124259 Free PMC article.
Targeting ion channels with ultra-large library screening for hit discovery.
Melancon K, Pliushcheuskaya P, Meiler J, Künze G. Melancon K, et al. Front Mol Neurosci. 2024 Jan 5;16:1336004. doi: 10.3389/fnmol.2023.1336004. eCollection 2023. Front Mol Neurosci. 2024. PMID: 38249296 Free PMC article. Review.
Geometry-Complete Diffusion for 3D Molecule Generation and Optimization.
Morehead A, Cheng J. Morehead A, et al. ArXiv [Preprint]. 2024 May 24:arXiv:2302.04313v6. ArXiv. 2024. Update in: Commun Chem. 2024 Jul 3;7(1):150. doi: 10.1038/s42004-024-01233-z PMID: 36798459 Free PMC article. Updated. Preprint.

See all "Cited by" articles

References

1. Whitesides G. M. Reinventing chemistry. Angew. Chem., Int. Ed. 2015, 54, 3196–3209. 10.1002/anie.201410884. - DOI - PubMed
1. Schneider P.; Schneider G. De Novo Design at the Edge of Chaos: Miniperspective. J. Med. Chem. 2016, 59, 4077–4086. 10.1021/acs.jmedchem.5b01849. - DOI - PubMed
1. Reymond J.-L.; Ruddigkeit L.; Blum L.; van Deursen R. The enumeration of chemical space. Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717–733. 10.1002/wcms.1104. - DOI
1. Schneider G.; Baringhaus K.-H.. Molecular design: concepts and applications; John Wiley & Sons: 2008.
1. Stumpfe D.; Bajorath J. Similarity searching. Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260–282. 10.1002/wcms.23. - DOI

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

[1] Whitesides G. M. Reinventing chemistry. Angew. Chem., Int. Ed. 2015, 54, 3196–3209. 10.1002/anie.201410884. - DOI - PubMed

[2] Whitesides G. M. Reinventing chemistry. Angew. Chem., Int. Ed. 2015, 54, 3196–3209. 10.1002/anie.201410884. - DOI - PubMed

[3] Schneider P.; Schneider G. De Novo Design at the Edge of Chaos: Miniperspective. J. Med. Chem. 2016, 59, 4077–4086. 10.1021/acs.jmedchem.5b01849. - DOI - PubMed

[4] Schneider P.; Schneider G. De Novo Design at the Edge of Chaos: Miniperspective. J. Med. Chem. 2016, 59, 4077–4086. 10.1021/acs.jmedchem.5b01849. - DOI - PubMed

[5] Reymond J.-L.; Ruddigkeit L.; Blum L.; van Deursen R. The enumeration of chemical space. Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717–733. 10.1002/wcms.1104. - DOI

[6] Reymond J.-L.; Ruddigkeit L.; Blum L.; van Deursen R. The enumeration of chemical space. Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717–733. 10.1002/wcms.1104. - DOI

[7] Schneider G.; Baringhaus K.-H.. Molecular design: concepts and applications; John Wiley & Sons: 2008.

[8] Schneider G.; Baringhaus K.-H.. Molecular design: concepts and applications; John Wiley & Sons: 2008.

[9] Stumpfe D.; Bajorath J. Similarity searching. Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260–282. 10.1002/wcms.23. - DOI

[10] Stumpfe D.; Bajorath J. Similarity searching. Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260–282. 10.1002/wcms.23. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Affiliations

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources