One of the first jobs of somebody entering MLOps is to convert their manual scripts or notebooks into pipelines that can be deployed on the cloud. This job is tedious, and can take time. For example, one has to think about:
- Breaking down things into step functions
- Type annotating the steps properly
- Connecting the steps together in a pipeline
- Creating the appropriate YAML files to configure your pipeline
- Developing a Dockerfile or equivalent to encapsulate the environment.
Frameworks like ZenML go a long way in alleviating this burden by abstracting much of the complexity away. However, recent advancement in Large Language Model based Copilots offer hope that even more repetitive aspects of this task can be automated.
Unfortunately, most open source or proprietary models like GitHub Copilot are often lagging behind the most recent versions of ML libraries, therefore giving erroneous our outdated syntax when asked simple commands.
The goal of this project is fine-tune an open-source LLM that performs better than off-the-shelf solutions on giving the right output for the latest version of ZenML.
For this purpose of this project, we are going to be leveraging the excellent work of Sourab Mangrulkar and Sayak Paul, who fine-tuned the StarCoder model on the latest version of HuggingFace. They summarized their work in this blog post on HuggingFace.
Our data generation pipeline is based on the codegen repository, and the training pipeline is based on this script. All credit to Sourab and Sayak for putting this work together!
The work presented in this repository can easily be extended to other codebases and use-cases than just helping ML Engineering. You can easily modify the pipelines to point to other private codebases, and train a personal copilot on your codebase!
See the data generation pipeline as a starting point.
Now, we could take the code above and run it as scripts on some chosen ZenML repositories. But just to make it a bit more fun, we're going to be building ZenML pipelines to achieve this task!
That way we write ZenML pipelines to train a model that can produce ZenML pipelines 🐍. Sounds fun.
Specifically, we aim to create three pipelines:
- The data generation pipeline (here) that scrapes a chosen set of latest zenml version based repositories on GitHub, and pushes the dataset to HuggingFace.
- The training pipeline (here) that loads the dataset from the previous pipeline and launches a training job on a cloud provider to train the model.
- The deployment pipeline (here that deploys the model to huggingface inference endpoints)
The three pipelines can be run using the CLI:
# Data generation
python run.py --feature-engineering --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --feature-engineering --config generate_code_dataset.yaml
# Training
python run.py --training-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --training-pipeline --config finetune_gcp.yaml
# Deployment
python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --deployment-pipeline --config deployment_a100.yaml
The feature_engineering
and deployment
pipeline can be run simply with the default
stack, but the training pipelines stack will depend on the config.
The deployment
pipelines relies on the training_pipeline
to have run before.
We have create a custom zenml model deployer for deploying models on the huggingface inference endpoint. The code for custom deployer is in huggingface folder.
For running deployment pipeline, we create a custom zenml stack. As we are using a custom model deployer, we will have to register the flavor and model deployer. We update the stack to use this custom model deployer for running deployment pipeline.
zenml init
zenml stack register zencoder_hf_stack -o default -a default
zenml stack set zencoder_hf_stack
export HUGGINGFACE_USERNAME=<here>
export HUGGINGFACE_TOKEN=<here>
export NAMESPACE=<here>
zenml secret create huggingface_creds --username=$HUGGINGFACE_USERNAME --token=$HUGGINGFACE_TOKEN
zenml model-deployer flavor register huggingface.hf_model_deployer_flavor.HuggingFaceModelDeployerFlavor
Afterward, you should see the new flavor in the list of available flavors:
zenml model-deployer flavor list
Register model deployer component into the current stack
zenml model-deployer register hfendpoint --flavor=hfendpoint --token=$HUGGINGFACE_TOKEN --namespace=$NAMESPACE
zenml stack update zencoder_hf_stack -d hfendpoint
Run the deployment pipeline using the CLI:
# Deployment
python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
python run.py --deployment-pipeline --config deployment_a100.yaml
A working prototype has been trained and deployed as of Jan 19 2024. The model is using minimal data and finetuned using QLoRA and PEFT. The model was trained using 1 A100 GPU on the cloud:
- Training dataset Link
- PEFT Model Link
- Fully merged model (Ready to deploy on HuggingFace Inference Endpoints) Link
The Weights & Biases logs for the latest training runs are available here: Link
The ZenML Pro was used to manage the pipelines, models, and deployments. Here are some screenshots of the process:
This project recently did a call of volunteers. This TODO list can serve as a source of collaboration. If you want to work on any of the following, please create an issue on this repository and assign it to yourself!
-
Create a functioning data generation pipeline (initial dataset with the core ZenML repo scraped and pushed here)
-
Deploy the model on a HuggingFace inference endpoint and use it in the VS Code Extension using a deployment pipeline.
-
Create a functioning training pipeline.
-
Curate a set of 5-10 repositories that are using the ZenML latest syntax and use data generation pipeline to push dataset to HuggingFace.
-
Create a Dockerfile for the training pipeline with all requirements installed including ZenML, torch, CUDA etc. CUrrently I am having trouble creating this in this config file. Probably might make sense to create a docker imag with the right CUDA and requirements including ZenML. See here: https://sdkdocs.zenml.io/0.54.0/integration_code_docs/integrations-aws/#zenml.integrations.aws.flavors.sagemaker_step_operator_flavor.SagemakerStepOperatorSettings
-
Tests trained model on various metrics
-
Create a custom model deployer that deploys a huggingface model from the hub to a huggingface inference endpoint. This would involve creating a custom model deployer and editing the deployment pipeline accordingly
While the work here is solely based on the task of finetuning the model for the ZenML library, the pipeline can be changed with minimal effort to point to any set of repositories on GitHub. Theoretically, one could extend this work to point to proprietary codebases to learn from them for any use-case.
For example, see how VMWare fine-tuned StarCoder to learn their style.
Also, make sure to join our Slack Community to become part of the ZenML family!