refactor(breaking): unify LLM API (#283)

Signed-off-by: aarnphm-ec2-dev <29749331+aarnphm@users.noreply.github.com> Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
bentoml · Sep 1, 2023 · 3e45530 · 3e45530
1 parent 35601da
commit 3e45530
Show file tree

Hide file tree

Showing 50 changed files with 880 additions and 1,231 deletions.
diff --git a/.github/SECURITY.md b/.github/SECURITY.md
@@ -8,8 +8,7 @@ are backward compatible. We are more lenient with patch as the development can
 move quickly.
 
 If you are just using public API, then feel free to always upgrade. Whenever
-there is a breaking policies, it will become a `DeprecationWarning` with a
-period of 12 months before becoming broken.
+there is a breaking policies, it will be announced and will be broken.
 
 > [!WARNING]
 > Everything package under `openllm` that has an underscore prefixes

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -230,7 +230,7 @@ No significant changes.
 
   ```bash
   docker run --rm --gpus all -it -v /home/ubuntu/.local/share/bentoml:/tmp/bentoml -e BENTOML_HOME=/tmp/bentoml \
-              -e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_LLAMA_FRAMEWORK=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`
+              -e OPENLLM_USE_LOCAL_LATEST=True -e OPENLLM_BACKEND=vllm ghcr.io/bentoml/openllm:2b5e96f90ad314f54e07b5b31e386e7d688d9bb2 start llama --model-id meta-llama/Llama-2-7b-chat-hf --workers-per-resource conserved --debug`
   ```
 
   In conjunction with this, OpenLLM now also have a set of small CLI utilities via ``openllm ext`` for ease-of-use
@@ -721,9 +721,6 @@ No significant changes.
   `openllm start` now support `--quantize int8` and `--quantize int4` `GPTQ`
   quantization support is on the roadmap and currently being worked on.
 
-  `openllm start` now also support `--bettertransformer` to use
-  `BetterTransformer` for serving.
-
   Refactored `openllm.LLMConfig` to be able to use with `__getitem__`:
   `openllm.DollyV2Config()['requirements']`.
 
@@ -732,8 +729,6 @@ No significant changes.
 
   Added `towncrier` workflow to easily generate changelog entries
 
-  Added `use_pipeline`, `bettertransformer` flag into ModelSettings
-
   `LLMConfig` now supported `__dataclass_transform__` protocol to help with
   type-checking
 

diff --git a/README.md b/README.md
@@ -407,24 +407,29 @@ pip install "openllm[baichuan]"
 ### Runtime Implementations (Experimental)
 
 Different LLMs may have multiple runtime implementations. For instance, they
-might use Pytorch (`pt`), Tensorflow (`tf`), or Flax (`flax`).
+might use Pytorch (`pt`), Tensorflow (`tf`), Flax (`flax`) or vLLM (`vllm`).
 
 If you wish to specify a particular runtime for a model, you can do so by
-setting the `OPENLLM_{MODEL_NAME}_FRAMEWORK={runtime}` environment variable
+setting the `OPENLLM_BACKEND={runtime}` environment variable
 before running `openllm start`.
 
 For example, if you want to use the Tensorflow (`tf`) implementation for the
 `flan-t5` model, you can use the following command:
 
 ```bash
-OPENLLM_FLAN_T5_FRAMEWORK=tf openllm start flan-t5
+OPENLLM_BACKEND=tf openllm start flan-t5
+
+openllm start flan-t5 --backend tf
 ```
 
 > [!NOTE]
 > For GPU support on Flax, refers to
 > [Jax's installation](https://github.com/google/jax#pip-installation-gpu-cuda-installed-via-pip-easier)
 > to make sure that you have Jax support for the corresponding CUDA version.
 
+> [!IMPORTANT]
+> To use vLLM backend, at least a GPU with Ampere or newer architecture and CUDA 11.8 is required.
+
 ### Quantisation
 
 OpenLLM supports quantisation with

diff --git a/changelog.d/283.breaking.md b/changelog.d/283.breaking.md
@@ -0,0 +1,20 @@
+All environment variable now will be more simplified, without the need for the specific model prefix
+
+For example: OPENLLM_LLAMA_GENERATION_MAX_NEW_TOKENS now becomes OPENLLM_GENERATION_MAX_NEW_TOKENS
+
+Unify some misc environment variable. To switch different backend, one can use `--backend` for both `start` and `build`
+
+```bash
+openllm start llama --backend vllm
+```
+
+or the environment variable `OPENLLM_BACKEND`
+
+```bash
+OPENLLM_BACKEND=vllm openllm start llama
+```
+
+`openllm.Runner` now will default to try download the model the first time if the model is not available, and get the cached in model store consequently
+
+Model serialisation now updated to a new API version with more clear name change, kindly ask users to do `openllm prune -y --include-bentos` and update to
+this current version of openllm
diff --git a/hatch.toml b/hatch.toml
@@ -31,7 +31,6 @@ check-stubs = [
 inplace-changelog = "towncrier build --version main --keep"
 quality = [
     "./tools/dependencies.py",
-    "./tools/update-readme.py",
     "- ./tools/update-brew-tap.py",
     "bash ./tools/sync-readme.sh",
     "check-stubs",

diff --git a/openllm-client/src/openllm_client/_base.py b/openllm-client/src/openllm_client/_base.py
@@ -28,7 +28,7 @@
   import transformers
 
   from openllm_core._typing_compat import DictStrAny
-  from openllm_core._typing_compat import LiteralRuntime
+  from openllm_core._typing_compat import LiteralBackend
 
 logger = logging.getLogger(__name__)
 
@@ -98,7 +98,7 @@ def _hf_agent(self) -> transformers.HfAgent:
       raise RuntimeError(
           "transformers is required to use HF agent. Install with 'pip install \"openllm-client[agents]\"'.")
     if not self.supports_hf_agent:
-      raise RuntimeError(f'{self.model_name} ({self.framework}) does not support running HF agent.')
+      raise RuntimeError(f'{self.model_name} ({self.backend}) does not support running HF agent.')
     if not is_transformers_supports_agent():
       raise RuntimeError(
           "Current 'transformers' does not support Agent. Make sure to upgrade to at least 4.29: 'pip install -U \"transformers>=4.29\"'"
@@ -125,9 +125,9 @@ def model_id(self) -> str:
       raise RuntimeError('Malformed service endpoint. (Possible malicious)') from None
 
   @property
-  def framework(self) -> LiteralRuntime:
+  def backend(self) -> LiteralBackend:
     try:
-      return self._metadata['framework']
+      return self._metadata['backend']
     except KeyError:
       raise RuntimeError('Malformed service endpoint. (Possible malicious)') from None
 

diff --git a/openllm-client/src/openllm_client/py.typed b/openllm-client/src/openllm_client/py.typed
diff --git a/openllm-contrib/clojure/src/main/openllm/components/side_bar/model_selection/db.cljs b/openllm-contrib/clojure/src/main/openllm/components/side_bar/model_selection/db.cljs
@@ -31,10 +31,10 @@
 (s/def ::model_id (s/coll-of string? :kind vector?))                   ;; model_id is a vector of all models for a given model_type
 (s/def ::url string?)                                                  ;; url to the model's page
 (s/def ::requires_gpu boolean?)                                        ;; whether the model requires a gpu
-(s/def ::runtime_impl ::vec-of-runtimes?)                              ;; supported runtimes
+(s/def ::backend ::vec-of-runtimes?)                                   ;; supported runtimes
 (s/def ::installation string?)                                         ;; installation instructions (pip command)
 (s/def ::model-spec (s/keys :req-un [::model_id ::url ::requires_gpu   ;; the spec for a single model (aggregates all the above)
-                                     ::runtime_impl ::installation]))
+                                     ::backend ::installation]))
 (s/def ::all-models #(or loading-text                                  ;; -- this is the case when the file with the model data has not been loaded yet by the ::set-model-data effect
                          (s/map-of keyword? ::model-spec)))            ;; map of all models