Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 24;4(1):120-131.
doi: 10.1021/acscentsci.7b00512. Epub 2017 Dec 28.

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Affiliations

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Marwin H S Segler et al. ACS Cent Sci. .

Abstract

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active toward a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria), it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Examples of molecules and their SMILES representation. To correctly create smiles, the model has to learn long-term dependencies, for example, to close rings (indicated by numbers) and brackets.
Figure 2
Figure 2
(a) Recursively defined RNN. (b) The same RNN, unrolled. The parameters θ (the weight matrices of the neural network) are shared over all time steps.
Figure 3
Figure 3
Symbol generation and sampling process. We start with a random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.
Figure 4
Figure 4
A few randomly selected, generated molecules. Ad = Adamantyl.
Figure 5
Figure 5
t-SNE projection of 7 physicochemical descriptors of random molecules from ChEMBL (blue) and molecules generated with the neural network trained on ChEMBL (green), to two unitless dimensions. The distributions of both sets overlap significantly.
Figure 6
Figure 6
Epochs of fine-tuning vs ratio of actives.
Figure 7
Figure 7
Nearest-neighbor Tanimoto similarity distribution of the generated molecules for 5-HT2A after n epochs of fine-tuning against the known actives. The generated molecules are distributed over the whole similarity range. Generated molecules with a medium similarity can be interesting for scaffold-hopping.
Figure 8
Figure 8
t-SNE plot of the pIC50 > 9 test set (blue) and the de novo molecules predicted to be active (green). The language model populates chemical space around the test molecules.
Figure 9
Figure 9
Different training strategies on the Staphylococcus aureus data set with 1000 training and 6051 test examples. Fine-tuning the pretrained model performs better than training from scratch (lower test loss [cross entropy] is better).
Figure 10
Figure 10
Scheme of our de novo design cycle. Molecules are generated by the chemical language model and then scored with the target prediction model (TPM). The inactives are filtered out, and the RNN is retrained. Here, the TPM is a machine learning model, but it could also be a robot conducting synthesis and biological assays, or a docking program.
Figure 11
Figure 11
Histogram of Levenshtein (string edit) distances of the SMILES of the reproduced molecules to their nearest neighbor in the training set (Staphylococcus aureus, model retrained on 50 actives). While in many cases the model makes changes of a few symbols in the SMILES, resembling the typical modifications applied when exploring series of compounds, the distribution of the distances indicates that the RNN also performs more complex changes by introducing larger moieties or generating molecules that are structurally different, but isofunctional to the training set.
Figure 12
Figure 12
Violin plot of the nearest-neighbor ECFP4-Tanimoto similarity distribution of the 50 training molecules against the rediscovered molecules in Table 3, entry 2. The distribution suggests that the model has learned to make typical small functional group replacements, but can also reproduce molecules which are not too similar to the training data.

Similar articles

Cited by

References

    1. Whitesides G. M. Reinventing chemistry. Angew. Chem., Int. Ed. 2015, 54, 3196–3209. 10.1002/anie.201410884. - DOI - PubMed
    1. Schneider P.; Schneider G. De Novo Design at the Edge of Chaos: Miniperspective. J. Med. Chem. 2016, 59, 4077–4086. 10.1021/acs.jmedchem.5b01849. - DOI - PubMed
    1. Reymond J.-L.; Ruddigkeit L.; Blum L.; van Deursen R. The enumeration of chemical space. Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717–733. 10.1002/wcms.1104. - DOI
    1. Schneider G.; Baringhaus K.-H.. Molecular design: concepts and applications; John Wiley & Sons: 2008.
    1. Stumpfe D.; Bajorath J. Similarity searching. Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260–282. 10.1002/wcms.23. - DOI