OpenLLM helps developers run any open-source LLMs, such as Llama 2 and Mistral, as OpenAI-compatible API endpoints, locally and in the cloud, optimized for serving throughput and production deployment.
- π Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
- βοΈ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
- π₯ State-of-the-art serving and inference performance
- π― Simplified cloud deployment via BentoML
For starter, we provide two ways to quickly try out OpenLLM:
Try this OpenLLM tutorial in Google Colab: Serving Phi 3 with OpenLLM.
The following provides instructions for how to get started with OpenLLM locally.
You have installed Python 3.9 (or later) andΒ pip
. We highly recommend using a Virtual Environment to prevent package conflicts.
Install OpenLLM by using pip
as follows:
pip install openllm
To verify the installation, run:
$ openllm -h
OpenLLM allows you to quickly spin up an LLM server using openllm start
. For example, to start aΒ Phi-3Β server, run the following:
openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code
To interact with the server, you can visit the web UI atΒ http://0.0.0.0:3000/ or send a request usingΒ curl
. You can also use OpenLLMβs built-in Python client to interact with the server:
import openllm
client = openllm.HTTPClient('http://localhost:3000')
client.generate('Explain to me the difference between "further" and "farther"')
OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
openllm start <model_id> --<options>
OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
Baichuan
Run the following command to quickly spin up a Baichuan server:
openllm start baichuan-inc/baichuan-7b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Baichuan variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Baichuan-compatible models.
You can specify any of the following Baichuan models via openllm start
:
ChatGLM
Run the following command to quickly spin up a ChatGLM server:
openllm start thudm/chatglm-6b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any ChatGLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more ChatGLM-compatible models.
You can specify any of the following ChatGLM models via openllm start
:
Cohere
Run the following command to quickly spin up a Cohere server:
openllm start CohereForAI/c4ai-command-r-plus --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Cohere variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Cohere-compatible models.
You can specify any of the following Cohere models via openllm start
:
Dbrx
Run the following command to quickly spin up a Dbrx server:
openllm start databricks/dbrx-instruct --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Dbrx variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Dbrx-compatible models.
You can specify any of the following Dbrx models via openllm start
:
DollyV2
Run the following command to quickly spin up a DollyV2 server:
openllm start databricks/dolly-v2-3b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any DollyV2 variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more DollyV2-compatible models.
You can specify any of the following DollyV2 models via openllm start
:
Falcon
Run the following command to quickly spin up a Falcon server:
openllm start tiiuae/falcon-7b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Falcon variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Falcon-compatible models.
You can specify any of the following Falcon models via openllm start
:
Gemma
Run the following command to quickly spin up a Gemma server:
openllm start google/gemma-7b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Gemma variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Gemma-compatible models.
You can specify any of the following Gemma models via openllm start
:
GPTNeoX
Run the following command to quickly spin up a GPTNeoX server:
openllm start eleutherai/gpt-neox-20b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any GPTNeoX variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more GPTNeoX-compatible models.
You can specify any of the following GPTNeoX models via openllm start
:
Llama
Run the following command to quickly spin up a Llama server:
openllm start NousResearch/llama-2-7b-hf --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Llama variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Llama-compatible models.
You can specify any of the following Llama models via openllm start
:
- meta-llama/Llama-2-70b-chat-hf
- meta-llama/Llama-2-13b-chat-hf
- meta-llama/Llama-2-7b-chat-hf
- meta-llama/Llama-2-70b-hf
- meta-llama/Llama-2-13b-hf
- meta-llama/Llama-2-7b-hf
- NousResearch/llama-2-70b-chat-hf
- NousResearch/llama-2-13b-chat-hf
- NousResearch/llama-2-7b-chat-hf
- NousResearch/llama-2-70b-hf
- NousResearch/llama-2-13b-hf
- NousResearch/llama-2-7b-hf
Mistral
Run the following command to quickly spin up a Mistral server:
openllm start mistralai/Mistral-7B-Instruct-v0.1 --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Mistral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mistral-compatible models.
You can specify any of the following Mistral models via openllm start
:
Mixtral
Run the following command to quickly spin up a Mixtral server:
openllm start mistralai/Mixtral-8x7B-Instruct-v0.1 --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Mixtral variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Mixtral-compatible models.
You can specify any of the following Mixtral models via openllm start
:
MPT
Run the following command to quickly spin up a MPT server:
openllm start mosaicml/mpt-7b-instruct --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any MPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more MPT-compatible models.
You can specify any of the following MPT models via openllm start
:
OPT
Run the following command to quickly spin up a OPT server:
openllm start facebook/opt-1.3b
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any OPT variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more OPT-compatible models.
You can specify any of the following OPT models via openllm start
:
Phi
Run the following command to quickly spin up a Phi server:
openllm start microsoft/Phi-3-mini-4k-instruct --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Phi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Phi-compatible models.
You can specify any of the following Phi models via openllm start
:
Qwen
Run the following command to quickly spin up a Qwen server:
openllm start qwen/Qwen-7B-Chat --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Qwen variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Qwen-compatible models.
You can specify any of the following Qwen models via openllm start
:
StableLM
Run the following command to quickly spin up a StableLM server:
openllm start stabilityai/stablelm-tuned-alpha-3b --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any StableLM variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StableLM-compatible models.
You can specify any of the following StableLM models via openllm start
:
StarCoder
Run the following command to quickly spin up a StarCoder server:
openllm start bigcode/starcoder --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any StarCoder variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more StarCoder-compatible models.
You can specify any of the following StarCoder models via openllm start
:
Yi
Run the following command to quickly spin up a Yi server:
openllm start 01-ai/Yi-6B --trust-remote-code
You can run the following code in a different terminal to interact with the server:
import openllm_client
client = openllm_client.HTTPClient('http://localhost:3000')
client.generate('What are large language models?')
Note: Any Yi variants can be deployed with OpenLLM. Visit the HuggingFace Model Hub to see more Yi-compatible models.
You can specify any of the following Yi models via openllm start
:
More models will be integrated with OpenLLM and we welcome your contributions if you want to incorporate your custom LLMs into the ecosystem. Check out Adding a New Model Guide to learn more.
Quantization is a technique to reduce the storage and computation requirements for machine learning models, particularly during inference. By approximating floating-point numbers as integers (quantized values), quantization allows for faster computations, reduced memory footprint, and can make it feasible to deploy large models on resource-constrained devices.
OpenLLM supports the following quantization techniques
- AWQ: Activation-aware Weight Quantization.
- GPTQ: Accurate Post-Training Quantization.
- SqueezeLLM: Dense-and-Sparse Quantization.
Note
Make sure to use pre-quantized models weights when using with openllm start
.
OpenLLM is not just a standalone product; it's a building block designed to integrate with other powerful tools easily. We currently offer integration with OpenAI's Compatible Endpoints, LlamaIndex, LangChain.
OpenLLM Server can be used as a drop-in replacement for OpenAI's API. Simply
specify the base_url to llm-endpoint/v1
and you are good to go:
import openai
client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na') # Here the server is running on 0.0.0.0:3000
completions = client.chat.completions.create(
prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
)
The compatible endpoints supports /chat/completions
, and /models
Note
You can find out OpenAI example clients under the examples folder.
You can use llama_index.llms.openllm.OpenLLMAPI
to interact with a LLM running server:
from llama_index.llms.openllm import OpenLLMAPI
Note
All synchronous and asynchronous API from llama_index.llms.OpenLLMAPI
are supported.
Make sure to install llama-index-integrations-llm-openllm
to use the supported class.
Spin up an OpenLLM server, and connect to it by specifying its URL:
from langchain.llms import OpenLLMAPI
llm = OpenLLMAPI(server_url='http://44.23.123.1:3000')
llm.invoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
# streaming
for it in llm.stream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'):
print(it, flush=True, end='')
# async context
await llm.ainvoke('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
# async streaming
async for it in llm.astream('What is the difference between a duck and a goose? And why there are so many Goose in Canada?'):
print(it, flush=True, end='')
There are several ways to deploy your LLMs:
-
Building a Bento: With OpenLLM, you can easily build a Bento for a specific model, like
mistralai/Mistral-7B-Instruct-v0.1
, using thebuild
command.:openllm build mistralai/Mistral-7B-Instruct-v0.1
A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artefacts, and dependencies.
-
Containerize your Bento
bentoml containerize <name:version>
This generates a OCI-compatible docker image that can be deployed anywhere docker runs. For best scalability and reliability of your LLM service in production, we recommend deploy with BentoCloudγ
Deploy OpenLLM with BentoCloud, the inference platform for fast moving AI teams.
-
Create a BentoCloud account: sign up here
-
Log into your BentoCloud account:
bentoml cloud login --api-token <your-api-token> --endpoint <bento-cloud-endpoint>
Note
Replace <your-api-token>
and <bento-cloud-endpoint>
with your
specific API token and the BentoCloud endpoint respectively.
-
Bulding a Bento: With OpenLLM, you can easily build a Bento for a specific model, such as
mistralai/Mistral-7B-Instruct-v0.1
:openllm build mistralai/Mistral-7B-Instruct-v0.1
-
Pushing a Bento: Push your freshly-built Bento service to BentoCloud via the
push
command:bentoml push <name:version>
-
Deploying a Bento: Deploy your LLMs to BentoCloud with a single
bentoml deployment create
command following the deployment instructions.
Engage with like-minded individuals passionate about LLMs, AI, and more on our Discord!
OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use π Join our Slack community!
We welcome contributions! If you're interested in enhancing OpenLLM's capabilities or have any questions, don't hesitate to reach out in our discord channel.
Checkout our Developer Guide if you wish to contribute to OpenLLM's codebase.
If you use OpenLLM in your research, we provide a citation to use:
@software{Pham_OpenLLM_Operating_LLMs_2023,
author = {Pham, Aaron and Yang, Chaoyu and Sheng, Sean and Zhao, Shenyang and Lee, Sauyon and Jiang, Bo and Dong, Fog and Guan, Xipeng and Ming, Frost},
license = {Apache-2.0},
month = jun,
title = {{OpenLLM: Operating LLMs in production}},
url = {https://github.com/bentoml/OpenLLM},
year = {2023}
}