llm-finetuning

ZenCoder: LLMOps pipelines to train and deploy a model to produce MLOps pipelines.

Join our

Slack Community and join the zencoder channel!

☮️ Fine-tuning an open source LLM to create MLOps pipelines

One of the first jobs of somebody entering MLOps is to convert their manual scripts or notebooks into pipelines that can be deployed on the cloud. This job is tedious, and can take time. For example, one has to think about:

Breaking down things into step functions
Type annotating the steps properly
Connecting the steps together in a pipeline
Creating the appropriate YAML files to configure your pipeline
Developing a Dockerfile or equivalent to encapsulate the environment.

Frameworks like ZenML go a long way in alleviating this burden by abstracting much of the complexity away. However, recent advancement in Large Language Model based Copilots offer hope that even more repetitive aspects of this task can be automated.

Unfortunately, most open source or proprietary models like GitHub Copilot are often lagging behind the most recent versions of ML libraries, therefore giving erroneous our outdated syntax when asked simple commands.

The goal of this project is fine-tune an open-source LLM that performs better than off-the-shelf solutions on giving the right output for the latest version of ZenML.

🌎 Inspiration and Credit

For this purpose of this project, we are going to be leveraging the excellent work of Sourab Mangrulkar and Sayak Paul, who fine-tuned the StarCoder model on the latest version of HuggingFace. They summarized their work in this blog post on HuggingFace.

Our data generation pipeline is based on the codegen repository, and the training pipeline is based on this script. All credit to Sourab and Sayak for putting this work together!

🧑‍✈️ Train your own copilot

The work presented in this repository can easily be extended to other codebases and use-cases than just helping ML Engineering. You can easily modify the pipelines to point to other private codebases, and train a personal copilot on your codebase!

See the data generation pipeline as a starting point.

🍍Methodology

Now, we could take the code above and run it as scripts on some chosen ZenML repositories. But just to make it a bit more fun, we're going to be building ZenML pipelines to achieve this task!

That way we write ZenML pipelines to train a model that can produce ZenML pipelines 🐍. Sounds fun.

Specifically, we aim to create three pipelines:

The data generation pipeline (here) that scrapes a chosen set of latest zenml version based repositories on GitHub, and pushes the dataset to HuggingFace.
The training pipeline (here) that loads the dataset from the previous pipeline and launches a training job on a cloud provider to train the model.
The deployment pipeline (here that deploys the model to huggingface inference endpoints)

🏃 How to run

The three pipelines can be run using the CLI:

# Data generation
python run.py --feature-engineering --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --feature-engineering --config generate_code_dataset.yaml

# Training
python run.py --training-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --training-pipeline --config finetune_gcp.yaml

# Deployment
python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --deployment-pipeline --config deployment_a100.yaml

The feature_engineering and deployment pipeline can be run simply with the default stack, but the training pipelines stack will depend on the config.

The deployment pipelines relies on the training_pipeline to have run before.

☁️ Deployment

We have create a custom zenml model deployer for deploying models on the huggingface inference endpoint. The code for custom deployer is in huggingface folder.

For running deployment pipeline, we create a custom zenml stack. As we are using a custom model deployer, we will have to register the flavor and model deployer. We update the stack to use this custom model deployer for running deployment pipeline.

zenml init
zenml stack register zencoder_hf_stack -o default -a default
zenml stack set zencoder_hf_stack
export HUGGINGFACE_USERNAME=<here>
export HUGGINGFACE_TOKEN=<here>
export NAMESPACE=<here>
zenml secret create huggingface_creds --username=$HUGGINGFACE_USERNAME --token=$HUGGINGFACE_TOKEN
zenml model-deployer flavor register huggingface.hf_model_deployer_flavor.HuggingFaceModelDeployerFlavor

Afterward, you should see the new flavor in the list of available flavors:

zenml model-deployer flavor list

Register model deployer component into the current stack

zenml model-deployer register hfendpoint --flavor=hfendpoint --token=$HUGGINGFACE_TOKEN --namespace=$NAMESPACE
zenml stack update zencoder_hf_stack -d hfendpoint

Run the deployment pipeline using the CLI:

# Deployment
python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --deployment-pipeline --config deployment_a100.yaml

🥇Recent developments

A working prototype has been trained and deployed as of Jan 19 2024. The model is using minimal data and finetuned using QLoRA and PEFT. The model was trained using 1 A100 GPU on the cloud:

Training dataset Link
PEFT Model Link
Fully merged model (Ready to deploy on HuggingFace Inference Endpoints) Link

The Weights & Biases logs for the latest training runs are available here: Link

The ZenML Pro was used to manage the pipelines, models, and deployments. Here are some screenshots of the process:

📓 To Do

This project recently did a call of volunteers. This TODO list can serve as a source of collaboration. If you want to work on any of the following, please create an issue on this repository and assign it to yourself!

Create a functioning data generation pipeline (initial dataset with the core ZenML repo scraped and pushed here)
Deploy the model on a HuggingFace inference endpoint and use it in the VS Code Extension using a deployment pipeline.
Create a functioning training pipeline.
Curate a set of 5-10 repositories that are using the ZenML latest syntax and use data generation pipeline to push dataset to HuggingFace.
Create a Dockerfile for the training pipeline with all requirements installed including ZenML, torch, CUDA etc. CUrrently I am having trouble creating this in this config file. Probably might make sense to create a docker imag with the right CUDA and requirements including ZenML. See here: https://sdkdocs.zenml.io/0.54.0/integration_code_docs/integrations-aws/#zenml.integrations.aws.flavors.sagemaker_step_operator_flavor.SagemakerStepOperatorSettings
Tests trained model on various metrics
Create a custom model deployer that deploys a huggingface model from the hub to a huggingface inference endpoint. This would involve creating a custom model deployer and editing the deployment pipeline accordingly

💡 More Applications

While the work here is solely based on the task of finetuning the model for the ZenML library, the pipeline can be changed with minimal effort to point to any set of repositories on GitHub. Theoretically, one could extend this work to point to proprietary codebases to learn from them for any use-case.

For example, see how VMWare fine-tuned StarCoder to learn their style.

Also, make sure to join our Slack Community to become part of the ZenML family!

Name		Name	Last commit message	Last commit date
parent directory ..
.assets		.assets
configs		configs
huggingface		huggingface
materializers		materializers
pipelines		pipelines
steps		steps
.copier-answers.yml		.copier-answers.yml
.dockerignore		.dockerignore
LICENSE		LICENSE
README.md		README.md
license		license
license_header		license_header
requirements.txt		requirements.txt
run.py		run.py
test_starcoder_bigcode.py		test_starcoder_bigcode.py
test_zencoder.py		test_zencoder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-finetuning

llm-finetuning

README.md

ZenCoder: LLMOps pipelines to train and deploy a model to produce MLOps pipelines.

☮️ Fine-tuning an open source LLM to create MLOps pipelines

🌎 Inspiration and Credit

🧑‍✈️ Train your own copilot

🍍Methodology

🏃 How to run

☁️ Deployment

🥇Recent developments

📓 To Do

💡 More Applications

Files

llm-finetuning

Directory actions

More options

Directory actions

More options

Latest commit

History

llm-finetuning

Folders and files

parent directory

README.md

ZenCoder: LLMOps pipelines to train and deploy a model to produce MLOps pipelines.

☮️ Fine-tuning an open source LLM to create MLOps pipelines

🌎 Inspiration and Credit

🧑‍✈️ Train your own copilot

🍍Methodology

🏃 How to run

☁️ Deployment

🥇Recent developments

📓 To Do

💡 More Applications