Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(breaking): unify LLM API #283

Merged
merged 9 commits into from
Sep 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ are backward compatible. We are more lenient with patch as the development can
move quickly.

If you are just using public API, then feel free to always upgrade. Whenever
there is a breaking policies, it will become a `DeprecationWarning` with a
period of 12 months before becoming broken.
there is a breaking policies, it will be announced and will be broken.

> [!WARNING]
> Everything package under `openllm` that has an underscore prefixes
Expand Down
7 changes: 1 addition & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ No significant changes.

```bash
docker run --rm --gpus all -it -v /home/ubuntu/.local/share/bentoml:/tmp/bentoml -e BENTOML_HOME=/tmp/bentoml \
-e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_LLAMA_FRAMEWORK=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`
-e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_BACKEND=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`
```

In conjunction with this, OpenLLM now also have a set of small CLI utilities via ``openllm ext`` for ease-of-use
Expand Down Expand Up @@ -721,9 +721,6 @@ No significant changes.
`openllm start` now support `--quantize int8` and `--quantize int4` `GPTQ`
quantization support is on the roadmap and currently being worked on.

`openllm start` now also support `--bettertransformer` to use
`BetterTransformer` for serving.

Refactored `openllm.LLMConfig` to be able to use with `__getitem__`:
`openllm.DollyV2Config()['requirements']`.

Expand All @@ -732,8 +729,6 @@ No significant changes.

Added `towncrier` workflow to easily generate changelog entries

Added `use_pipeline`, `bettertransformer` flag into ModelSettings

`LLMConfig` now supported `__dataclass_transform__` protocol to help with
type-checking

Expand Down
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,24 +407,29 @@ pip install "openllm[baichuan]"
### Runtime Implementations (Experimental)

Different LLMs may have multiple runtime implementations. For instance, they
might use Pytorch (`pt`), Tensorflow (`tf`), or Flax (`flax`).
might use Pytorch (`pt`), Tensorflow (`tf`), Flax (`flax`) or vLLM (`vllm`).

If you wish to specify a particular runtime for a model, you can do so by
setting the `OPENLLM_{MODEL_NAME}_FRAMEWORK={runtime}` environment variable
setting the `OPENLLM_BACKEND={runtime}` environment variable
before running `openllm start`.

For example, if you want to use the Tensorflow (`tf`) implementation for the
`flan-t5` model, you can use the following command:

```bash
OPENLLM_FLAN_T5_FRAMEWORK=tf openllm start flan-t5
OPENLLM_BACKEND=tf openllm start flan-t5

openllm start flan-t5 --backend tf
```

> [!NOTE]
> For GPU support on Flax, refers to
> [Jax's installation](https://github.com/google/jax#pip-installation-gpu-cuda-installed-via-pip-easier)
> to make sure that you have Jax support for the corresponding CUDA version.

> [!IMPORTANT]
> To use vLLM backend, at least a GPU with Ampere or newer architecture and CUDA 11.8 is required.

### Quantisation

OpenLLM supports quantisation with
Expand Down
20 changes: 20 additions & 0 deletions changelog.d/283.breaking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
All environment variable now will be more simplified, without the need for the specific model prefix

For example: OPENLLM_LLAMA_GENERATION_MAX_NEW_TOKENS now becomes OPENLLM_GENERATION_MAX_NEW_TOKENS

Unify some misc environment variable. To switch different backend, one can use `--backend` for both `start` and `build`

```bash
openllm start llama --backend vllm
```

or the environment variable `OPENLLM_BACKEND`

```bash
OPENLLM_BACKEND=vllm openllm start llama
```

`openllm.Runner` now will default to try download the model the first time if the model is not available, and get the cached in model store consequently

Model serialisation now updated to a new API version with more clear name change, kindly ask users to do `openllm prune -y --include-bentos` and update to
this current version of openllm
1 change: 0 additions & 1 deletion hatch.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ check-stubs = [
inplace-changelog = "towncrier build --version main --keep"
quality = [
"./tools/dependencies.py",
"./tools/update-readme.py",
"- ./tools/update-brew-tap.py",
"bash ./tools/sync-readme.sh",
"check-stubs",
Expand Down
8 changes: 4 additions & 4 deletions openllm-client/src/openllm_client/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
import transformers

from openllm_core._typing_compat import DictStrAny
from openllm_core._typing_compat import LiteralRuntime
from openllm_core._typing_compat import LiteralBackend

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -98,7 +98,7 @@ def _hf_agent(self) -> transformers.HfAgent:
raise RuntimeError(
"transformers is required to use HF agent. Install with 'pip install \"openllm-client[agents]\"'.")
if not self.supports_hf_agent:
raise RuntimeError(f'{self.model_name} ({self.framework}) does not support running HF agent.')
raise RuntimeError(f'{self.model_name} ({self.backend}) does not support running HF agent.')
if not is_transformers_supports_agent():
raise RuntimeError(
"Current 'transformers' does not support Agent. Make sure to upgrade to at least 4.29: 'pip install -U \"transformers>=4.29\"'"
Expand All @@ -125,9 +125,9 @@ def model_id(self) -> str:
raise RuntimeError('Malformed service endpoint. (Possible malicious)') from None

@property
def framework(self) -> LiteralRuntime:
def backend(self) -> LiteralBackend:
try:
return self._metadata['framework']
return self._metadata['backend']
except KeyError:
raise RuntimeError('Malformed service endpoint. (Possible malicious)') from None

Expand Down
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,10 @@
(s/def ::model_id (s/coll-of string? :kind vector?)) ;; model_id is a vector of all models for a given model_type
(s/def ::url string?) ;; url to the model's page
(s/def ::requires_gpu boolean?) ;; whether the model requires a gpu
(s/def ::runtime_impl ::vec-of-runtimes?) ;; supported runtimes
(s/def ::backend ::vec-of-runtimes?) ;; supported runtimes
(s/def ::installation string?) ;; installation instructions (pip command)
(s/def ::model-spec (s/keys :req-un [::model_id ::url ::requires_gpu ;; the spec for a single model (aggregates all the above)
::runtime_impl ::installation]))
::backend ::installation]))
(s/def ::all-models #(or loading-text ;; -- this is the case when the file with the model data has not been loaded yet by the ::set-model-data effect
(s/map-of keyword? ::model-spec))) ;; map of all models

Expand Down
Loading