Generative Pre-Training from Molecules

Autoregressive transformer language model for drug discovery. (Pre)trained on a large SMILES corpus. Evaluated on molecular property prediction and low-data de novo design tasks.

Installation

Set up conda and create a new environment from environment.yml (if needed, make corresponding edits for GPU-compatibility).

conda env create -f environment.yml
conda activate smiles-gpt
git clone https://github.com/sanjaradylov/smiles-gpt.git
cd smiles-gpt

Benchmark

Notebooks

notebooks/language-modeling.ipynb pretrains GPT-2 on 10M Pubchem SMILES data.
notebooks/selfies-anygpt introduces AnyGPT for pretraining 1D molecular data.

Checkpoints

checkpoints/ stores serialized model, tokenizer, and configuration. Do not modify them. Use from_pretrained method to load HuggingFace objects, e.g.,

from transformers import GPT2Config, GPT2LMHeadModel, PreTrainedTokenizerFast

checkpoint = "checkpoints/benchmark-5m"

config = GPT2Config.from_pretrained(checkpoint)
model = GPT2LMHeadModel.from_pretrained(checkpoint)
tokenizer = PreTrainedTokenizerFast.from_pretrained(checkpoint)

Data

data stores Blood-Brain Barrier Penetration classification dataset and 10K subset of ChemBERTa's PubChem-10M. See Examples.

Output

output stores generated SMILES strings.

Examples

Adapter training for molecular property prediction (replace data/bbbp.csv and p_np arguments with your dataset and taskname(s), respectively):

python3 scripts/classification.py checkpoints/benchmark-5m data/bbbp.csv p_np

For language model pretraining, see notebooks.

Citation

If you use smiles-gpt in your research, please consider citing

https://doi.org/10.33774/chemrxiv-2021-5fwjd

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
checkpoints		checkpoints
data		data
notebooks		notebooks
output		output
scripts		scripts
smiles_gpt		smiles_gpt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative Pre-Training from Molecules

Installation

Benchmark

Notebooks

Checkpoints

Data

Output

Examples

Citation

About

Releases

Packages

Languages

License

sanjaradylov/smiles-gpt

Folders and files

Latest commit

History

Repository files navigation

Generative Pre-Training from Molecules

Installation

Benchmark

Notebooks

Checkpoints

Data

Output

Examples

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages