Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan;37(1-2):1700123.
doi: 10.1002/minf.201700123. Epub 2017 Dec 13.

Application of Generative Autoencoder in De Novo Molecular Design

Affiliations

Application of Generative Autoencoder in De Novo Molecular Design

Thomas Blaschke et al. Mol Inform. 2018 Jan.

Abstract

A major challenge in computational chemistry is the generation of novel molecular structures with desirable pharmacological and physiochemical properties. In this work, we investigate the potential use of autoencoder, a deep learning methodology, for de novo molecular design. Various generative autoencoders were used to map molecule structures into a continuous latent space and vice versa and their performance as structure generator was assessed. Our results show that the latent space preserves chemical similarity principle and thus can be used for the generation of analogue structures. Furthermore, the latent space created by autoencoders were searched systematically to generate novel compounds with predicted activity against dopamine receptor type 2 and compounds similar to known active compounds not included in the trainings set were identified.

Keywords: Autoencoder; chemoinformatics; de novo molecular design; deep learning; inverse QSAR.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An autoencoder is a coordinated pair of NNs. The encoder converts a high‐ dimensional input, e. g. a molecule, into a continuous numerical representation with fixed dimensionality. The decoder reconstructs the input from the numerical representation.
Figure 2
Figure 2
Encoding and decoding of a molecule using a variational autoencoder. The encoder converts a molecule structure into a Gaussian distribution deterministically. Given the generated mean and variance, a new point is sampled and fed into the decoder. The decoder then generates a new molecule from the sampled point.
Figure 3
Figure 3
Sequence generation using teachers forcing. The last decoder trained with teachers forcing receives two inputs: the output of the previous layer and a character from the previous time step. In the training mode, the previous character is equal to the corresponding character from the input sequence, regardless of the probability output. During the generation mode the decoder samples at each time step a new character based on the output probability and uses this as input for the next time step.
Figure 4
Figure 4
Learning process of an adversarial autoencoder. The encoder converts a molecule directly into a numerical representation. During training the output is not only fed into the decoder but also into a discriminator. The discriminator is trained to distinguish between the output of the encoder and a randomly sampled point from a prior distribution. The encoder is trained to “fool” the discriminator by mimicking the target prior distribution.
Figure 5
Figure 5
Different representations of 4‐(bromomethyl)‐1H‐pyrazole. Exemplary generation of the one‐hot representation derived from the SMILES. For simplicity only a reduced vocabulary is shown here, while in practice a larger vocabulary that covers all tokens present in the training data is used.
Figure 6
Figure 6
Sampled structures at the latent vector corresponding to Celecoxib. The structures are sorted by the relative generation frequencies in descending order from left to right.
Figure 7
Figure 7
(a) Chemical similarity (Tanimoto, ECFP6) of generated structures to Celecoxib in relation to the distance in the latent space. (b) Fraction of valid SMILES generated during the reconstruction
Figure 8
Figure 8
Results without Celecoxib in trainings set. (a) Chemical similarity (Tanimoto, ECFP6) of generated structures to Celecoxib in relation to the distance in the latent space. (b) Fraction of valid SMILES generated during the reconstruction.
Figure 9
Figure 9
Searching for DRD2 active compounds using the Uniform AAE. The first 100 iterations are randomly sampled points while the next 500 iterations are determined by Bayesian optimization.
Figure 10
Figure 10
Generated structures from BO compared to the nearest neighbour from the set of validated actives. The validated actives were not present in the training set of the autoencoder. The Tanimoto similarity is calculated using the ECFP6 fingerprint.
Figure 11
Figure 11
The relationship between the fraction of generated active compounds at specific latent points and the BO score. The fraction of generated actives is the number of actives divided by all 500 reconstruction attempts. The set “Random” corresponds to the randomly selected latent points.

Similar articles

Cited by

References

    1. Ma J., Sheridan R. P., Liaw A., Dahl G. E., Svetnik V., J. Chem. Inf. Model. 2015, 55, 263–274. - PubMed
    1. Segler M. H. S., Kogej T., Tyrchan C., Waller M. P., ArXiv:1701.01329 Phys. Stat 2017.
    1. Yuan W., Jiang D., Nambiar D. K., Liew L. P., Hay M. P., Bloomstein J., Lu P., Turner B., Le Q.-T., Tibshirani R., et al., J. Chem. Inf. Model. 2017, 57, 875–882. - PMC - PubMed
    1. Jaques N., Gu S., Bahdanau D., Lobato J. M. H., Turner R. E., Eck D., ArXiv:1611.02796 Cs 2016.
    1. Bickerton G. R., Paolini G. V., Besnard J., Muresan S., Hopkins A. L., Nat. Chem. 2012, 4, 90–98. - PMC - PubMed

Publication types

MeSH terms