skypilot-org · Michaelvll · Dec 15, 2023 · Dec 11, 2023 · Dec 11, 2023 · Dec 11, 2023
diff --git a/README.md b/README.md
@@ -27,6 +27,7 @@
 
 ----
 :fire: *News* :fire:
+- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**official Mistral docs**](https://docs.mistral.ai/cloud-deployment/skypilot/) (may not work yet, due to the outdated docker image), [**our SkyServe Example**](./llm/mixtral/).
 - [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
 - [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/cloud-deployment/skypilot/)
 - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
@@ -136,6 +137,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
 
 Runnable examples:
 - LLMs on SkyPilot
+  - [Mixtral 8x7B](./llm/mixtral/)
   - [Mistral 7B](https://docs.mistral.ai/cloud-deployment/skypilot/) (from official Mistral team)
   - [vLLM: Serving LLM 24x Faster On the Cloud](./llm/vllm/) (from official vLLM team)
   - [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team)

diff --git a/llm/mixtral/README.md b/llm/mixtral/README.md
@@ -0,0 +1,90 @@
+# Serving Mixtral from Mistral.ai
+
+Mistral AI released Mixtral 8x7B, a high-quality sparse mixture of experts model (SMoE) with open weights. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. Mistral.ai uses SkyPilot as [the default way](https://docs.mistral.ai/self-deployment/skypilot) to distribute their new model. This folder contains the code to serve Mixtral on any cloud with SkyPilot. 
+
+There are three ways to serve the model:
+
+## 1. Serve with a single instance
+
+SkyPilot can help you serve Mixtral by automatically finding available resources on any cloud, provisioning the VM, opening the ports, and serving the model. To serve Mixtral with a single instance, run the following command:
+
+```bash
+sky launch -c mixtral ./serve.yaml
+```
+
+Note that we specify the following resources, so that SkyPilot will automatically find any of the available GPUs specified by automatically [failover](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) through all the candidates (in the order of the prices):
+
+```yaml
+resources:
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+```
+
+### Accessing the model
+
+We can now access the model through the OpenAI API with the IP and port:
+
+```bash
+IP=$(sky status --ip mixtral)
+
+curl -L http://$IP:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "mistralai/Mistral-7B-v0.1",
+      "prompt": "My favourite condiment is",
+      "max_tokens": 25
+  }'
+```
+
+## 2. Serve with multiple instances
+
+When scaling up is required, SkyServe is the library built on top of SkyPilot, which can help you scale up the serving with multiple instances, while still providing a single endpoint. To serve Mixtral with multiple instances, run the following command:
+
+```bash
+sky serve up -n mixtral ./serve.yaml
+```
+
+The additional arguments for serving specifies the way to check the healthiness of the service and manage the auto-restart of the service when unexpected failure happens:
+```yaml
+service:
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: mistralai/Mixtral-8x7B-Instruct-v0.1
+      messages:
+        - role: user
+          content: Hello! What is your name?
+    initial_delay_seconds: 1200
+  replica_policy:
+    min_replicas: 1
+    auto_restart: true
+```
+
+Optional: To further save the cost by 3-4x, we can use the spot instances as the replicas, and SkyServe will automatically manage the spot instances, monitor the prices and preemptions, and restart the replica when needed.
+To do so, we can add `use_spot: true` to the `resources` field, i.e.:
+```yaml
+resources:
+  use_spot: true
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+```
+
+### Accessing the model
+
+After the `sky serve up` command, there will be a single endpoint for the service. We can access the model through the OpenAI API with the IP and port:
+
+```bash
+ENDPOINT=$(sky serve status --endpoint mixtral)
+
+curl -L http://$ENDPOINT/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+      "model": "mistralai/Mistral-7B-v0.1",
+      "prompt": "My favourite condiment is",
+      "max_tokens": 25
+  }'
+```
+
+## 3. Official guide from Mistral AI
+
+Mistral.ai also includes a guide for launching the Mixtral 8x7B model with SkyPilot in their official doc. Please refer to [this link](https://docs.mistral.ai/self-deployment/skypilot) for more details.
+
+> Note: the docker image of the official doc may not be updated yet, which can cause a failure where vLLM is complaining about the missing support for the model. Please feel free to create a new docker image with the setup commands in our [serve.yaml](./serve.yaml) file instead.
diff --git a/llm/mixtral/serve.yaml b/llm/mixtral/serve.yaml
@@ -0,0 +1,44 @@
+# A example yaml for serving Mixtral model from Mistral.ai with an OpenAI API.
+# Usage:
+#  1. Launch on a single instance: `sky launch mixtral ./serve.yaml`
+#  2. Scale up to multiple instances with a single endpoint:
+#     `sky serve up -n mixtral ./serve.yaml`
+service:
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: mistralai/Mixtral-8x7B-Instruct-v0.1
+      messages:
+        - role: user
+          content: Hello! What is your name?
+    initial_delay_seconds: 1200
+  replica_policy:
+    min_replicas: 2
+    auto_restart: true
+
+resources: 
+  accelerators: {A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+  ports: 8000
+  disk_tier: high
+
+setup: |
+  conda activate mixtral
+  if [ $? -ne 0 ]; then
+    conda create -n mixtral -y python=3.10
+    conda activate mixtral
+  fi
+  # We have to manually install Torch otherwise apex & xformers won't build
+  pip list | grep torch || pip install "torch>=2.0.0"
+
+  pip list | grep vllm || pip install "git+https://github.com/vllm-project/vllm.git"
+  pip install git+https://github.com/huggingface/transformers
+  pip list | grep megablocks || pip install megablocks
+
+run: |
+  conda activate mixtral
+  export PATH=$PATH:/sbin
+  python -u -m vllm.entrypoints.openai.api_server \
+                --host 0.0.0.0 \
+                --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
+                --tensor-parallel-size 2 | tee ~/openai_api_server.log
+
diff --git a/sky/cli.py b/sky/cli.py
@@ -4278,13 +4278,12 @@ def serve_up(
         with ux_utils.print_exception_no_traceback():
             raise ValueError('Service section not found in the YAML file. '
                              'To fix, add a valid `service` field.')
-    assert len(task.resources) == 1
-    requested_resources = list(task.resources)[0]
-    if requested_resources.ports is None or len(requested_resources.ports) != 1:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError(
-                'Must only specify one port in resources. Each replica '
-                'will use the port specified as application ingress port.')
+    for requested_resources in task.resources:
+        if requested_resources.ports is None or len(requested_resources.ports) != 1:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'Must only specify one port in resources. Each replica '
+                    'will use the port specified as application ingress port.')
 
     click.secho('Service Spec:', fg='cyan')
     click.echo(task.service)

diff --git a/sky/serve/core.py b/sky/serve/core.py
@@ -56,13 +56,16 @@ def up(
         with ux_utils.print_exception_no_traceback():
             raise RuntimeError('Service section not found.')
 
-    assert len(task.resources) == 1, task
-    requested_resources = list(task.resources)[0]
-    if requested_resources.ports is None or len(requested_resources.ports) != 1:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError(
-                'Must only specify one port in resources. Each replica '
-                'will use the port specified as application ingress port.')
+    requested_cloud = None
+    for requested_resources in task.resources:
+        if requested_resources.ports is None or len(requested_resources.ports) != 1:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'Must only specify one port in resources. Each replica '
+                    'will use the port specified as application ingress port.')
+        if requested_cloud is None:
+            requested_cloud = requested_resources.cloud
+
 
     controller_utils.maybe_translate_local_file_mounts_and_sync_up(task,
                                                                    path='serve')
@@ -102,8 +105,10 @@ def up(
         controller_exist = (
             global_user_state.get_cluster_from_name(controller_name)
             is not None)
+
+
         controller_cloud = (
-            requested_resources.cloud if not controller_exist and
+            requested_cloud if not controller_exist and
             controller_resources.cloud is None else controller_resources.cloud)
         # TODO(tian): Probably run another sky.launch after we get the load
         # balancer port from the controller? So we don't need to open so many