[LLM] Example for deploy Mixtral model (#2857)

* Add example for mixtral * update the serve.yaml * install dependencies * remove useless file * simplify port * mixtral official doc * fix model name * Fix setup * Add sky serve up and docs * Use 2 replica instead * Fix typo * Add official doc * fix * fix * nits * Update llm/mixtral/README.md Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update llm/mixtral/README.md Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update sky/cli.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * Add method for accessing the models * Address comment * Add optional: * Fix news * Update README.md Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Add skyserve docs link * lint * Fix readme and adopt changes from #2860 * fix mixtral link * fix link --------- Co-authored-by: cblmemo <cblmemo@gmail.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
skypilot-org · Dec 15, 2023 · 23c780f · 23c780f
1 parent 7e2c249
commit 23c780f
Show file tree

Hide file tree

Showing 3 changed files with 137 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -27,6 +27,7 @@
 
 ----
 :fire: *News* :fire:
+- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**single replica**](https://docs.mistral.ai/self-deployment/skypilot/); [**multiple replicas**](./llm/mixtral/).
 - [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
 - [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
 - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
@@ -136,7 +137,8 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
 
 Runnable examples:
 - LLMs on SkyPilot
-  - [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot) (from official Mistral team)
+  - [Mixtral 8x7B](./llm/mixtral/)
+  - [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot/) (from official Mistral team)
   - [vLLM: Serving LLM 24x Faster On the Cloud](./llm/vllm/) (from official vLLM team)
   - [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team)
   - [Train your own Vicuna on Llama-2](./llm/vicuna-llama-2/)

diff --git a/llm/mixtral/README.md b/llm/mixtral/README.md
@@ -0,0 +1,90 @@
+# Serving Mixtral from Mistral.ai
+
+Mistral AI released Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. Mistral.ai uses SkyPilot as [the default way](https://docs.mistral.ai/self-deployment/skypilot) to distribute their new model. This folder contains the code to serve Mixtral on any cloud with SkyPilot. 
+
+There are three ways to serve the model:
+
+## 1. Serve with a single instance
+
+SkyPilot can help you serve Mixtral by automatically finding available resources on any cloud, provisioning the VM, opening the ports, and serving the model. To serve Mixtral with a single instance, run the following command:
+
+```bash
+sky launch -c mixtral ./serve.yaml
+```
+
+Note that we specify the following resources, so that SkyPilot will automatically find any of the available GPUs specified by automatically [failover](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) through all the candidates (in the order of the prices):
+
+```yaml
+resources:
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+```
+
+### Accessing the model
+
+We can now access the model through the OpenAI API with the IP and port:
+
+```bash
+IP=$(sky status --ip mixtral)
+
+curl -L http://$IP:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "mistralai/Mistral-7B-v0.1",
+      "prompt": "My favourite condiment is",
+      "max_tokens": 25
+  }'
+```
+
+## 2. Serve with multiple instances
+
+When scaling up is required, [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) is the library built on top of SkyPilot, which can help you scale up the serving with multiple instances, while still providing a single endpoint. To serve Mixtral with multiple instances, run the following command:
+
+```bash
+sky serve up -n mixtral ./serve.yaml
+```
+
+The additional arguments for serving specifies the way to check the healthiness of the service and manage the auto-restart of the service when unexpected failure happens:
+```yaml
+service:
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: mistralai/Mixtral-8x7B-Instruct-v0.1
+      messages:
+        - role: user
+          content: Hello! What is your name?
+    initial_delay_seconds: 1200
+  replica_policy:
+    min_replicas: 1
+    auto_restart: true
+```
+
+Optional: To further save the cost by 3-4x, we can use the spot instances as the replicas, and SkyServe will automatically manage the spot instances, monitor the prices and preemptions, and restart the replica when needed.
+To do so, we can add `use_spot: true` to the `resources` field, i.e.:
+```yaml
+resources:
+  use_spot: true
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+```
+
+### Accessing the model
+
+After the `sky serve up` command, there will be a single endpoint for the service. We can access the model through the OpenAI API with the IP and port:
+
+```bash
+ENDPOINT=$(sky serve status --endpoint mixtral)
+
+curl -L http://$ENDPOINT/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
+      "prompt": "My favourite condiment is",
+      "max_tokens": 25
+  }'
+```
+
+## 3. Official guide from Mistral AI
+
+Mistral.ai also includes a guide for launching the Mixtral 8x7B model with SkyPilot in their official doc. Please refer to [this link](https://docs.mistral.ai/self-deployment/skypilot) for more details.
+
+> Note: the docker image of the official doc may not be updated yet, which can cause a failure where vLLM is complaining about the missing support for the model. Please feel free to create a new docker image with the setup commands in our [serve.yaml](./serve.yaml) file instead.
diff --git a/llm/mixtral/serve.yaml b/llm/mixtral/serve.yaml
@@ -0,0 +1,44 @@
+# A example yaml for serving Mixtral model from Mistral.ai with an OpenAI API.
+# Usage:
+#  1. Launch on a single instance: `sky launch mixtral ./serve.yaml`
+#  2. Scale up to multiple instances with a single endpoint:
+#     `sky serve up -n mixtral ./serve.yaml`
+service:
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: mistralai/Mixtral-8x7B-Instruct-v0.1
+      messages:
+        - role: user
+          content: Hello! What is your name?
+    initial_delay_seconds: 1200
+  replica_policy:
+    min_replicas: 2
+    auto_restart: true
+
+resources: 
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+  ports: 8000
+  disk_tier: high
+
+setup: |
+  conda activate mixtral
+  if [ $? -ne 0 ]; then
+    conda create -n mixtral -y python=3.10
+    conda activate mixtral
+  fi
+  # We have to manually install Torch otherwise apex & xformers won't build
+  pip list | grep torch || pip install "torch>=2.0.0"
+
+  pip list | grep vllm || pip install "git+https://github.com/vllm-project/vllm.git"
+  pip install git+https://github.com/huggingface/transformers
+  pip list | grep megablocks || pip install megablocks
+
+run: |
+  conda activate mixtral
+  export PATH=$PATH:/sbin
+  python -u -m vllm.entrypoints.openai.api_server \
+                --host 0.0.0.0 \
+                --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
+                --tensor-parallel-size 2 | tee ~/openai_api_server.log
+