[Doc] Customize KubeRay container commands (ray-project#41651)

Update doc for ray-project/kuberay#1704. --------- Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
ShuN6211 · Dec 8, 2023 · 25d16d3 · 25d16d3
1 parent abf6fd2
commit 25d16d3
Showing 1 changed file with 70 additions and 103 deletions.
diff --git a/doc/source/cluster/kubernetes/user-guides/pod-command.md b/doc/source/cluster/kubernetes/user-guides/pod-command.md
@@ -1,18 +1,74 @@
 (kuberay-pod-command)=
 
 # Specify container commands for Ray head/worker Pods
-You can execute commands on the head/worker pods at two timings:
 
-* (1) **Before `ray start`**: As an example, you can set up some environment variables that will be used by `ray start`.
+KubeRay generates a `ray start` command for each Ray Pod.
+Sometimes, you may want to execute certain commands either before or after the ray start command, or you may wish to define the container's command yourself.
+This document shows you how to do that.
 
-* (2) **After `ray start` (RayCluster is ready)**: As an example, you can launch a Ray serve deployment when the RayCluster is ready.
+## Part 1: Specify a custom container command, optionally including the generated `ray start` command
 
-## Current KubeRay operator behavior for container commands
-* The current behavior for container commands is not finalized, and **may be updated in the future**.
-* See [code](https://github.com/ray-project/kuberay/blob/47148921c7d14813aea26a7974abda7cf22bbc52/ray-operator/controllers/ray/common/pod.go#L301-L326) for more details.
+Starting with KubeRay v1.1.0, if users add the annotation `ray.io/overwrite-container-cmd: "true"` to a RayCluster, KubeRay respects the container `command` and `args` as provided by the users, without including any generated command, including the `ulimit` and the `ray start` commands, with the latter stored in the environment variable `KUBERAY_GEN_RAY_START_CMD`.
 
-## Timing 1: Before `ray start`
-Currently, for timing (1), we can set the container's `Command` and `Args` in RayCluster specification to reach the goal.
+```yaml
+apiVersion: ray.io/v1
+kind: RayCluster
+metadata:
+  annotations:
+    # If this annotation is set to "true", KubeRay will respect the container `command` and `args`.
+    ray.io/overwrite-container-cmd: "true"
+  ...
+spec:
+  headGroupSpec:
+    rayStartParams: {}
+    # Pod template
+    template:
+      spec:
+        containers:
+        - name: ray-head
+          image: rayproject/ray:2.8.0
+          # Because the annotation "ray.io/overwrite-container-cmd" is set to "true",
+          # KubeRay will overwrite the generated container command with `command` and
+          # `args` in the following. Hence, you need to specify the `ulimit` command
+          # by yourself to avoid Ray scalability issues.
+          command: ["/bin/bash", "-lc", "--"]
+          # Starting from v1.1.0, KubeRay injects the environment variable `KUBERAY_GEN_RAY_START_CMD`
+          # into the Ray container. This variable can be used to retrieve the generated Ray start command.
+          # Note that this environment variable does not include the `ulimit` command.
+          args: ["ulimit -n 65536; echo head; $KUBERAY_GEN_RAY_START_CMD"]
+          ...
+```
+
+The preceding example YAML is a part of [ray-cluster.overwrite-command.yaml](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.overwrite-command.yaml).
+
+* `metadata.annotations.ray.io/overwrite-container-cmd: "true"`: This annotation tells KubeRay to respect the container `command` and `args` as provided by the users, without including any generated command.
+Refer to Part 2 for the default behavior if you set the annotation to "false" or don't set it at all.
+
+* `ulimit -n 65536`: This command is necessary to avoid Ray scalability issues caused by running out of file descriptors.
+If you don't set the annotation, KubeRay automatically injects the `ulimit` command into the container.
+
+* `$KUBERAY_GEN_RAY_START_CMD`: Starting from KubeRay v1.1.0, KubeRay injects the environment variable `KUBERAY_GEN_RAY_START_CMD` into the Ray container for both head and worker Pods to store the `ray start` command generated by KubeRay.
+Note that this environment variable doesn't include the `ulimit` command.
+  ```sh
+  # Example of the environment variable `KUBERAY_GEN_RAY_START_CMD` in the head Pod.
+  ray start --head  --dashboard-host=0.0.0.0  --num-cpus=1  --block  --metrics-export-port=8080  --memory=2147483648
+  ``` 
+
+The head Pod's `command`/`args` looks like the following:
+
+```yaml
+Command:
+  /bin/bash
+  -lc
+  --
+Args:
+  ulimit -n 65536; echo head; $KUBERAY_GEN_RAY_START_CMD
+```
+
+## Part 2: Execute commands before the generated `ray start` command
+
+If you only want to execute commands before the generated command, you don't need to set the annotation `ray.io/overwrite-container-cmd: "true"`.
+Some users employ this method to set up environment variables used by `ray start`.
 
 ```yaml
 # https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
@@ -23,7 +79,7 @@ Currently, for timing (1), we can set the container's `Command` and `Args` in Ra
       spec:
         containers:
         - name: ray-head
-          image: rayproject/ray:2.5.0
+          image: rayproject/ray:2.8.0
           resources:
             ...
           ports:
@@ -33,12 +89,11 @@ Currently, for timing (1), we can set the container's `Command` and `Args` in Ra
           args: ["456"]
 ```
 
-* Ray head Pod
-    * `spec.containers.0.command` is hardcoded with `["/bin/bash", "-lc", "--"]`.
-    * `spec.containers.0.args` contains two parts:
-        * (Part 1) **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` from RayCluster and `headGroupSpec.template.spec.containers.0.args` from RayCluster together.
-        * (Part 2) **ray start command**: The command is created based on `rayStartParams` specified in RayCluster. The command will look like `ulimit -n 65536; ray start ...`.
-        * To summarize, `spec.containers.0.args` will be `$(user-specified command) && $(ray start command)`.
+* `spec.containers.0.command`: KubeRay hard codes `["/bin/bash", "-lc", "--"]` as the container's command.
+* `spec.containers.0.args` contains two parts:
+  * **user-specified command**: A string concatenates `headGroupSpec.template.spec.containers.0.command` and `headGroupSpec.template.spec.containers.0.args` together.
+  * **ray start command**: KubeRay creates the command based on `rayStartParams` specified in RayCluster. The command looks like `ulimit -n 65536; ray start ...`.
+  * To summarize, `spec.containers.0.args` is `$(user-specified command) && $(ray start command)`.
 
 * Example
     ```sh
@@ -63,91 +118,3 @@ Currently, for timing (1), we can set the container's `Command` and `Args` in Ra
     # Args:
     #    echo 123  456  && ulimit -n 65536; ray start --head  --dashboard-host=0.0.0.0  --num-cpus=1  --block  --metrics-export-port=8080  --memory=2147483648
     ```
-
-
-## Timing 2: After `ray start` (RayCluster is ready)
-We have two solutions to execute commands after the RayCluster is ready. The main difference between these two solutions is users can check the logs via `kubectl logs` with Solution 1.
-
-### Solution 1: Container command (Recommended)
-As we mentioned in the section "Timing 1: Before `ray start`", user-specified command will be executed before the `ray start` command. Hence, we can execute the `ray_cluster_resources.sh` in background by updating `headGroupSpec.template.spec.containers.0.command` in `ray-cluster.head-command.yaml`.
-
-```yaml
-# https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
-# Parentheses for the command is required.
-command: ["(/home/ray/samples/ray_cluster_resources.sh&)"]
-
-# ray_cluster_resources.sh
-apiVersion: v1
-kind: ConfigMap
-metadata:
-  name: ray-example
-data:
-  ray_cluster_resources.sh: |
-    #!/bin/bash
-
-    # wait for ray cluster to finish initialization
-    while true; do
-        ray health-check 2>/dev/null
-        if [ "$?" = "0" ]; then
-            break
-        else
-            echo "INFO: waiting for ray head to start"
-            sleep 1
-        fi
-    done
-
-    # Print the resources in the ray cluster after the cluster is ready.
-    python -c "import ray; ray.init(); print(ray.cluster_resources())"
-
-    echo "INFO: Print Ray cluster resources"
-```
-
-* Example
-    ```sh
-    # (1) Update `command` to ["(/home/ray/samples/ray_cluster_resources.sh&)"]
-    # (2) Comment out `postStart` and `args`.
-    kubectl apply -f ray-cluster.head-command.yaml
-
-    # Check ${RAYCLUSTER_HEAD_POD}
-    kubectl get pod -l ray.io/node-type=head
-
-    # Check the logs
-    kubectl logs ${RAYCLUSTER_HEAD_POD}
-
-    # INFO: waiting for ray head to start
-    # .
-    # . => Cluster initialization
-    # .
-    # 2023-02-16 18:44:43,724 INFO worker.py:1231 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
-    # 2023-02-16 18:44:43,724 INFO worker.py:1352 -- Connecting to existing Ray cluster at address: 10.244.0.26:6379...
-    # 2023-02-16 18:44:43,735 INFO worker.py:1535 -- Connected to Ray cluster. View the dashboard at http://10.244.0.26:8265
-    # {'object_store_memory': 539679129.0, 'node:10.244.0.26': 1.0, 'CPU': 1.0, 'memory': 2147483648.0}
-    # INFO: Print Ray cluster resources
-    ```
-
-### Solution 2: postStart hook
-```yaml
-# https://github.com/ray-project/kuberay/ray-operator/config/samples/ray-cluster.head-command.yaml
-lifecycle:
-  postStart:
-    exec:
-      command: ["/bin/sh","-c","/home/ray/samples/ray_cluster_resources.sh"]
-```
-
-* We execute the script `ray_cluster_resources.sh` via the postStart hook. Based on [this document](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks), there is no guarantee that the hook will execute before the container ENTRYPOINT. Hence, we need to wait for RayCluster to finish initialization in `ray_cluster_resources.sh`.
-
-* Example
-    ```sh
-    kubectl apply -f ray-cluster.head-command.yaml
-
-    # Check ${RAYCLUSTER_HEAD_POD}
-    kubectl get pod -l ray.io/node-type=head
-
-    # Forward the port of Dashboard
-    kubectl port-forward --address 0.0.0.0 ${RAYCLUSTER_HEAD_POD} 8265:8265
-
-    # Open the browser and check the Dashboard (${YOUR_IP}:8265/#/job).
-    # You shold see a SUCCEEDED job with the following Entrypoint:
-    #
-    # `python -c "import ray; ray.init(); print(ray.cluster_resources())"`
-    ```