TurboPilot is a self-hosted copilot clone which uses the library behind llama.cpp to run the 6 Billion Parameter Salesforce Codegen model in 4GiB of RAM. It is heavily based and inspired by on the fauxpilot project.
NB: This is a proof of concept right now rather than a stable tool. Autocompletion is quite slow in this version of the project. Feel free to play with it, but your mileage may vary.
NEW: As of v0.0.5 turbopilot supports cuda inference which greatly accelerates suggestions when working with longer prompts (i.e. longer existing code files).
PRs to this project and the corresponding GGML fork are very welcome.
Make a fork, make your changes and then open a PR.
The easiest way to try the project out is to grab the pre-processed models and then run the server in docker.
You have 2 options for getting the model
You can download the pre-converted, pre-quantized models from Huggingface.
The multi
flavour models can provide auto-complete suggestions for C
, C++
, Go
, Java
, JavaScript
, and Python
.
The mono
flavour models can provide auto-complete suggestions for Python
only (but the quality of Python-specific suggestions may be higher).
Pre-converted and pre-quantized models are available for download from here:
Model Name | RAM Requirement | Supported Languages | Direct Download | HF Project Link |
---|---|---|---|---|
CodeGen 350M multi | ~800MiB | C , C++ , Go , Java , JavaScript , Python |
β¬οΈ | π€ |
CodeGen 350M mono | ~800MiB | Python |
β¬οΈ | π€ |
CodeGen 2B multi | ~4GiB | C , C++ , Go , Java , JavaScript , Python |
β¬οΈ | π€ |
CodeGen 2B mono | ~4GiB | Python |
β¬οΈ | π€ |
CodeGen 6B multi | ~8GiB | C , C++ , Go , Java , JavaScript , Python |
β¬οΈ | π€ |
CodeGen 6B mono | ~8GiB | Python |
β¬οΈ | π€ |
Follow this guide if you want to experiment with quantizing the models yourself.
Download the latest binary and extract it to the root project folder. If a binary is not provided for your OS or you'd prefer to build it yourself follow the build instructions
Run:
./codegen-serve -m ./models/codegen-6B-multi-ggml-4bit-quant.bin
The application should start a server on port 18080
If you have a multi-core system you can control how many CPUs are used with the -t
option - for example, on my AMD Ryzen 5000 which has 6 cores/12 threads I use:
./codegen-serve -t 6 -m ./models/codegen-6B-multi-ggml-4bit-quant.bin
You can also run Turbopilot from the pre-built docker image supplied here
You will still need to download the models separately, then you can run:
docker run --rm -it \
-v ./models:/models \
-e THREADS=6 \
-e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \
-p 18080:18080 \
ghcr.io/ravenscroftj/turbopilot:latest
As of release v0.0.5 turbocode now supports CUDA inference. In order to run the cuda-enabled container you will need to have nvidia-docker enabled, use the cuda tagged versions and pass in --gpus=all
to docker with access to your GPU like so:
docker run --gpus=all --rm -it \
-v ./models:/models \
-e THREADS=6 \
-e MODEL="/models/codegen-2B-multi-ggml-4bit-quant.bin" \
-p 18080:18080 \
ghcr.io/ravenscroftj/turbopilot:v0.0.5-cuda
You will need CUDA 11 or later to run this container. You should be able to see /app/codegen-serve
listed when you run nvidia-smi
.
As of v0.0.5 a CUDA version of the linux executable is available - it requires that libcublas 11 be installed on the machine - I might build ubuntu debs at some point but for now running in docker may be more convenient if you want to use a CUDA GPU.
Support for the official VS Code copilot plugin is underway (See ticket #11). The API should now be broadly compatible with OpenAI.
To use the API from VSCode, I recommend the vscode-fauxpilot plugin. Once you install it, you will need to change a few settings in your settings.json file.
- Open settings (CTRL/CMD + SHIFT + P) and select
Preferences: Open User Settings (JSON)
- Add the following values:
{
... // other settings
"fauxpilot.enabled": true,
"fauxpilot.server": "http://localhost:18080/v1/engines",
}
Now you can enable fauxpilot with CTRL + SHIFT + P
and select Enable Fauxpilot
The plugin will send API calls to the running codegen-serve
process when you make a keystroke. It will then wait for each request to complete before sending further requests.
You can make requests to http://localhost:18080/v1/engines/codegen/completions
which will behave just like the same Copilot endpoint.
For example:
curl --request POST \
--url http://localhost:18080/v1/engines/codegen/completions \
--header 'Content-Type: application/json' \
--data '{
"model": "codegen",
"prompt": "def main():",
"max_tokens": 100
}'
Should get you something like this:
{
"choices": [
{
"logprobs": null,
"index": 0,
"finish_reason": "length",
"text": "\n \"\"\"Main entry point for this script.\"\"\"\n logging.getLogger().setLevel(logging.INFO)\n logging.basicConfig(format=('%(levelname)s: %(message)s'))\n\n parser = argparse.ArgumentParser(\n description=__doc__,\n formatter_class=argparse.RawDescriptionHelpFormatter,\n epilog=__doc__)\n "
}
],
"created": 1681113078,
"usage": {
"total_tokens": 105,
"prompt_tokens": 3,
"completion_tokens": 102
},
"object": "text_completion",
"model": "codegen",
"id": "01d7a11b-f87c-4261-8c03-8c78cbe4b067"
}
Again I want to set expectations around this being a proof-of-concept project. With that in mind. Here are some current known limitations.
As of v0.0.2:
- The models can be quite slow - especially the 6B ones. It can take ~30-40s to make suggestions across 4 CPU cores.
- I've only tested the system on Ubuntu 22.04 but I am now supplying ARM docker images and soon I'll be providing ARM binary releases.
- Sometimes suggestions get truncated in nonsensical places - e.g. part way through a variable name or string name. This is due to a hard limit of 2048 on the context length (prompt + suggestion).
- This project would not have been possible without Georgi Gerganov's work on GGML and llama.cpp
- It was completely inspired by fauxpilot which I did experiment with for a little while but wanted to try to make the models work without a GPU
- The frontend of the project is powered by Venthe's vscode-fauxpilot plugin
- The project uses the Salesforce Codegen models.
- Thanks to Moyix for his work on converting the Salesforce models to run in a GPT-J architecture. Not only does this confer some speed benefits but it also made it much easier for me to port the models to GGML using the existing gpt-j example code
- The model server uses CrowCPP to serve suggestions.
- Check out the original scientific paper for CodeGen for more info.