Skip to content

adamoyoung/chemnlp

 
 

Repository files navigation

Contributor Covenant

ChemNLP project 🧪🚀

The ChemNLP project aims to

  1. create an extensive chemistry dataset and
  2. use it to train large language models (LLMs) that can leverage the data for a wide range of chemistry applications.

For more details see our information material section below.

Information material

Community

Feel free to join our #chemnlp channel on our OpenBioML discord server to start the discussion in more detail.

Contributing

ChemNLP is an open-source project - your involvement is warmly welcome! If you're excited to join us, we recommend the following steps:

Note on the "ChemNLP" name

Our OpenBioML ChemNLP project is not afiliated to the ChemNLP library from NIST and we use "ChemNLP" as a general term to highlight our project focus. The datasets and models we create through our project will have a unique and recognizable name when we release them.

About OpenBioML.org

See https://openbioml.org, especially our approach and partners.

Installation and set-up

Create a new conda environment with Python 3.8:

conda create -n chemnlp python=3.8
conda activate chemnlp

To install the chemnlp package (and required dependencies):

pip install chemnlp

If working on developing the python package:

pip install -e "chemnlp[dev]"  # to install development dependencies

If extra dependencies are required (e.g. for dataset creation) but are not needed for the main package please add to the pyproject.toml in the dataset_creation variable and ensure this is reflected in the conda.yml file.

Then, please run

pre-commit install

to install the pre-commit hooks. These will automatically format and lint your code upon every commit. There might be some warnings, e.g., by flake8. If you struggle with them, do not hestiate to contact us.

Note

If working on model training, request access to the wandb project chemnlp and log-in to wandb with your API key per here.

Adding a new dataset (to the model training pipline)

We specify datasets by creating a new function here which is named per the dataset on Hugging Face. At present the function must accept a tokenizer and return back the tokenized train and validation datasets.

Installing submodules

In order to ensure you also clone and install the required submodules (i.e. gpt-neox) you will have to do one of the following;

  • Recursively clone the submodule from GitHub

     # using ssh (if you have your ssh key on GitHub)
    git clone --recurse-submodules --remote-submodules git@github.com:OpenBioML/chemnlp.git
    
     # using https (if you use personal access token)
    git clone --recurse-submodules --remote-submodules [git@github.com:OpenBioML/chemnlp.git ](https://github.com/OpenBioML/chemnlp.git)
    

    This will automatically initialize and update each submodule in the repository, including nested submodules if any of the submodules in the repository have submodules themselve

  • Initialise and install the submodule after cloning

    git submodule init # registers submodule
    git submodule update # clones and updates submodule
    

Experiments

Follow the guidelines here for more information about running experiments on the Stability AI cluster.

About

ChemNLP project

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 56.8%
  • Jupyter Notebook 41.1%
  • Shell 2.1%