A repository for various introductory tutorials on deep learning for chemistry.
NOTE: We have stripped all the code in this repo of comments and docstrings. We leave the task of adding new ones back to the code as an exercise for folks new to machine learning in chemistry, as it can be a good way of testing if you understand what the code is doing. Give it a go!
There is often a gap between computer science and chemistry coursework and research. Most classes now support Jupyter Notebooks or Google Colab enviornments that have simple install, setup, and often require running only small blocks of code. While very useful and didactic, we find there is a missing course explaining how students can structure repositories for new research projects that enable them to organize new experiments, try different model settings, and move quickly.
This repository is an opinionated attempt to show several ways to structure these repositories for basic tasks we expect any researcher at the intersection of machine learning and chemistry to implement. Specifically:
- Molecular property prediction with feed forward networks
- Molecular property prediction with graph neural networks
- Molecular generation with a SMILES LSTM
We recommend two ways to use this repository:
- Reattempting tasks
Consider attempting the tasks described from scratch and compare to how we've done it.
- Adding documentation
We recognize that attempting these may be too time consuming for shorter onboarding periods. As an alternative, we provide versions of the code with no documentation at github.com/coleygroup/dl-chem-101-stripped. As a useful exercise, we recommend forking the repo, running the code, and then adding documentation to each function (i.e., docstrings). Such docstrings should specify:
- What the function / class does
- The type and shape of the inputs and outputs
- Any complex details within the function using inline comments
- How to structure an ML-for-chemistry repository
- How to launch experiments for various model parameters and configurations
- How to separate analysis from model predictions
For those interested in attempting this on their own before viewing our solutions and structures, we provide the following guiding prompts and references.
In this repository, we will use a feed-forward neural network (FFN) to predict a molecular property relevant to drug discovery, Caco-2 cell permeability, from molecular fingerprints (originally demonstrated in Wang et al. (2016)).
We use data available for download via the Therapeutics Data Commons (TDC) (original paper introducing the TDC from Huang et al. (2021)).
This repository repeats the above task but utilizes graph neural networks that operaate on molecular graphs directly, rather than vectorized fingerprints.
Some foundational papers in graph neural network development for property prediction are Gilmer et al. (2017) and Duvenaud et al. (2015).
Several groups have compared performance between graph and fingerprint-based neural networks (i.e, MoleculeNet (Wu et al. (2017)) and ChemProp (Yang et al. (2019)))
In this repository, we will go through the process of training a SMILES long short-term memory (LSTM) network for molecular design tasks. At a high level, the model "sees" examples of valid molecular SMILES strings, and learns to generate new strings from the same distribution by progressively predicting the next token in the string. These models have a long history in natural language processing, in which context neural networks are trained to complete sentences when given a set of starting words.
We recommend reviewing both Segler et al. (2018) and Bjerrum, E. J. (2017), two of the earliest examples of such models.
Code for this example was adapted from the SMILES LSTM implementation in the Molecular AI REINVENT repository and structured as a stand-alone package.
Here, we train only on a smaller 50K subset of SMILES strings from the ZINC dataset available from the TDC. We also show how to run our model training script both on a local GPU and on an MIT/Lincoln Lab specific cluster, SuperCloud (using the Slurm-based LLSub system).