🦾 OpenLLM: Self-Hosting LLMs Made Easy

Introduction

OpenLLM lets developers run any open-source LLMs as OpenAI-compatible API endpoints with a single command.

🔬 Build for fast and production usages
🚂 Support a wide range of open-source LLMs including llama3, qwen2, gemma, etc and fine-tuned or quantized versions
⛓️ OpenAI-compatible API, easy to integrate with any openai dependent Apps
💬 Built-in ChatGPT like UI to interact with the LLM directly
🔥 Accelerated LLM decoding with state-of-the-art inference backends
🌥️ Optionally deploy to K8S or BentoCloud for an enterprise-grade LLM API service

Get started

Install OpenLLM with pip. OpenLLM currently supports Python 3.9 and above:

pip install openllm

One can then run:

openllm hello

to explore models interactively. It will guide you to run LLMs either locally or in the cloud.

openllm_hello.mp4

Supported models

OpenLLM supports a variety of state-of-the-art LLMs. Here are some of the models supported by OpenLLM, each listed with a commonly used model size.

Model	Parameters	Quantinize	Required VRAM	Start a Server
Llama 3	8B	-	24G	`openllm serve llama3:8b`
Llama 3	8B	AWQ 4bit	12G	`openllm serve llama3:8b-4bit`
Llama 3	70B	AWQ 4bit	80G	`openllm serve llama3:70b-4bit`
Llama 2	7B	-	16G	`openllm serve llama2:7b`
Llama 2	7B	AWQ 4bit	12G	`openllm serve llama2:7b-4bit`
Mistral	7B	-	24G	`openllm serve mistral:7b`
Qwen2	1.5B	-	12G	`openllm serve qwen2:1.5b`
Gemma	7B	-	24G	`openllm serve gemma:7b`
Phi3	3.8B	-	12G	`openllm serve phi3:3.8b`

...

For the full model list, see the OpenLLM models repository.

Start an LLM server

To start an LLM server locally, use the openllm serve command and specify the model version.

openllm serve llama3:8b

The server will be accessible at http://localhost:3000, providing OpenAI-compatible APIs for interaction. You can call the endpoints with different frameworks and tools that support OpenAI-compatible APIs. Typically, you may need to specify the following:

The API host address: By default, the LLM is hosted at http://localhost:3000.
The model name: The name can be different depending on the tool you use.
The API key: The API key used for client authentication. This is optional.

Here are some examples:

OpenAI Python client

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
# model_list = client.models.list()
# print(model_list)

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    print(chunk.choices[0].delta.content or "", end="")

LlamaIndex

from llama_index.llms.openai import OpenAI

llm = OpenAI(api_bese="http://localhost:3000/v1", model="meta-llama/Meta-Llama-3-8B-Instruct", api_key="dummy")
...

Chat UI

OpenLLM provides a chat user interface (UI) at the /chat endpoint for an LLM server. You can visit the chat UI at http://localhost:3000/chat and start different conversations with the model.

Chat with a model in the CLI

To start a chat conversation in the CLI, use the openllm run command and specify the model version.

openllm run llama3:8b

Model repository

A model repository in OpenLLM represents a catalog of available LLMs that you can run. OpenLLM provides a default model repository that includes the latest open-source LLMs like Llama 3, Mistral, and Qwen2, hosted at this GitHub repository. To see all available models from the default and any added repository, use:

openllm model list

To ensure your local list of models is synchronized with the latest updates from all connected repositories, run:

openllm repo update

To review a model’s information, run:

openllm model get llama3:8b

Add a model to the default model repository

You can contribute to the default model repository by adding new models that others can use. This involves creating and submitting a Bento of the LLM. For more information, check out this example pull request.

Set up a custom repository

You can add your own repository to OpenLLM with custom models. To do so, follow the format in the default OpenLLM model repository with a bentos directory to store custom LLMs. You need to build your Bentos with BentoML and submit them to your model repository.

First, prepare your custom models in a bentos directory following the guidelines provided by BentoML to build Bentos. Check out the default model repository for an example and read the Developer Guide for details.

Then, register your custom model repository with OpenLLM:

openllm repo add <repo-name> <repo-url>

Note: Currently, OpenLLM only supports adding public repositories.

Deploy to BentoCloud

OpenLLM supports LLM cloud deployment via BentoML, the unified model serving framework, and BentoCloud, an AI inference platform for enterprise AI teams. BentoCloud provides fully-managed infrastructure optimized for LLM inference with autoscaling, model orchestration, observability, and many more, allowing you to run any AI model in the cloud.

Sign up for BentoCloud for free and log in. Then, run openllm deploy to deploy a model to BentoCloud:

openllm deploy llama3:8b

Once the deployment is complete, you can run model inference on the BentoCloud console:

Community

OpenLLM is actively maintained by the BentoML team. Feel free to reach out and join us in our pursuit to make LLMs more accessible and easy to use 👉 Join our Slack community!

Also, engage with like-minded individuals passionate about LLMs, AI, and more on the OpenLLM Discord!

Contributing

As an open-source project, we welcome contributions of all kinds, such as new features, bug fixes, and documentation. Here are some of the ways to contribute:

Repost a bug by creating a GitHub issue.
Submit a pull request or help review other developers’ pull requests.
Add an LLM to the OpenLLM default model repository so that other users can run your model. See the pull request template.
Check out the Developer Guide to learn more.

Acknowledgements

This project uses the following open-source projects:

bentoml/bentoml for production level model serving
blrchen/chatgpt-lite for a fancy Web Chat UI
chujiezheng/chat_templates
astral-sh/uv for blazing fast model requirements installing

We are grateful to the developers and contributors of these projects for their hard work and dedication.

Name		Name	Last commit message	Last commit date
Latest commit History 1,738 Commits
.github		.github
src/openllm		src/openllm
.editorconfig		.editorconfig
.envrc.template		.envrc.template
.git-blame-ignore-revs		.git-blame-ignore-revs
.git_archival.txt		.git_archival.txt
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version-default		.python-version-default
.ruff.toml		.ruff.toml
CITATION.cff		CITATION.cff
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
release.sh		release.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦾 OpenLLM: Self-Hosting LLMs Made Easy

Introduction

Get started

Supported models

Start an LLM server

Chat UI

Chat with a model in the CLI

Model repository

Add a model to the default model repository

Set up a custom repository

Deploy to BentoCloud

Community

Contributing

Acknowledgements

About

Releases 134

Contributors 31

Languages

License

bentoml/OpenLLM

Folders and files

Latest commit

History

Repository files navigation

🦾 OpenLLM: Self-Hosting LLMs Made Easy

Introduction

Get started

Supported models

Start an LLM server

Chat UI

Chat with a model in the CLI

Model repository

Add a model to the default model repository

Set up a custom repository

Deploy to BentoCloud

Community

Contributing

Acknowledgements

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 134

Contributors 31

Languages