[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

kevin85421 · 2024-01-11T23:07:56Z

Why are these changes needed?

If users create a RayJob with a buggy entrypoint, causing the submitter K8s Job to consistently fail, it won't respect the ttlSecondsAfterFinished value set by users.

The submitter Job sends a request to the Ray head to create a Ray job. Due to the buggy entrypoint, the Ray job fails immediately, and KubeRay transitions the RayJob status to Complete. After ttlSecondsAfterFinished seconds following the transition of the RayJob to Complete, the RayCluster will be deleted.

The submitter K8s Job will retry 3 times by default. If the submitter Job attempts to send a request to the Ray head after the Ray head has been deleted, it will result in a request timeout and failure. I haven't figured out the exact timeout value, but based on my observation, it's around 1 to 2 minutes. Therefore, the K8s Job may be deleted after the RayCluster is deleted, typically within (x-1) * time_to_fail + n * timeout_seconds + ttlSecondsAfterFinished seconds.

x: The count of Ray jobs successfully submitted by KubeRay to the Ray head.
n: The count of attempts made by the K8s Job to submit the Ray job to the RayCluster when the Ray head is unavailable. Currently, n is less or equal to 2.

This PR decides not to use K8s Job's built-in ttlSecondsAfterFinished, but control the entire lifecycle of the K8s submitter Job by KubeRay.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

I performed an experiment based on this gist.

entrypoint: python /home/ray/samples/sample_code_1.py => sample_code_1.py doesn't exist
ttlSecondsAfterFinished: 10

In my experiment, with x as 2 and n as 1, the K8s Job is deleted 150 seconds after the RayCluster is deleted.

kevin85421 · 2024-01-12T00:30:27Z

cc @andrewsykim

andrewsykim · 2024-01-12T01:47:16Z

ray-operator/controllers/ray/rayjob_controller.go

+		if !job.DeletionTimestamp.IsZero() {
+			r.Log.Info("The Job deletion is ongoing.", "RayJob", rayJobInstance.Name, "Submitter K8s Job", job.Name)
+		} else {
+			if err := r.Client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); err != nil {


For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.

(a common problem with RayJobs that are immediately deleted is that it can be difficult to troubleshoot long running jobs because everything is cleaned up immedaitely)

For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.

This makes sense to me. Are you interested in opening a PR for this after this PR got merged? Thanks!

Sure thing!

Thanks! Would you mind opening an issue for it, so that I can assign it to you? (If I open the issue, I can't assign it to you unless you comment on it.)

architkulkarni

Looks good to me!

architkulkarni · 2024-01-12T19:45:42Z

Therefore, the K8s Job may be deleted after the RayCluster is deleted, typically within (x-1) * time_to_fail + n * timeout_seconds + ttlSecondsAfterFinished seconds.

x: The count of Ray jobs successfully submitted by KubeRay to the Ray head.

I don't understand why it depends on x, is that just something you experimentally observed or do you know the explanation?

kevin85421 · 2024-01-12T19:59:45Z

I don't understand why it depends on x, is that just something you experimentally observed or do you know the explanation?

The submitter Job successfully submits the request to the Ray head.
(1) The submitter Job fails the first time. (2) The Ray job within the RayCluster enters Failed, a terminal state.
The KubeRay operator transitions the RayJob CR to Complete.
The TTL is reached. The deletion of the RayCluster begins.
The submitter Job creates a new Pod and attempts to submit the request to the Ray head again. Since the Pods have not been completely terminated, the Ray head successfully receives the request.
The submitter Job fails for the second time.
All Ray Pods have been completely deleted.
The submitter Job creates a new Pod and attempts to submit the request to the Ray head again. However, as the head Pod no longer exists, it will not fail immediately but will wait until the request times out.
The Job fails three times, reaching the backoffLimit.
The TTL is reached. The deletion of the Job begins.

architkulkarni · 2024-01-12T20:03:08Z

Ah thanks, makes sense. I misunderstood and thought x referred to all past successful RayJobs (including completely unrelated RayJobs)

kevin85421 added 2 commits January 11, 2024 23:06

update

72e08ec

update

ceb64e1

kevin85421 marked this pull request as ready for review January 12, 2024 00:29

kevin85421 requested a review from architkulkarni January 12, 2024 00:29

kevin85421 assigned architkulkarni Jan 12, 2024

andrewsykim reviewed Jan 12, 2024

View reviewed changes

andrewsykim mentioned this pull request Jan 12, 2024

[Feature] Allow TTL configuration of RayJob submitter job #1832

Closed

2 tasks

kevin85421 requested a review from gvspraveen January 12, 2024 19:03

kevin85421 assigned gvspraveen Jan 12, 2024

architkulkarni approved these changes Jan 12, 2024

View reviewed changes

kevin85421 merged commit c55f3cc into ray-project:master Jan 12, 2024
24 checks passed

kevin85421 mentioned this pull request Jan 13, 2024

[RayJob][Status][19/n] Transition to Complete if the K8s Job fails #1833

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

kevin85421 commented Jan 11, 2024 •

edited

Loading

kevin85421 commented Jan 12, 2024

andrewsykim Jan 12, 2024

andrewsykim Jan 12, 2024

kevin85421 Jan 12, 2024

andrewsykim Jan 12, 2024

kevin85421 Jan 12, 2024

andrewsykim Jan 12, 2024

architkulkarni left a comment

architkulkarni commented Jan 12, 2024

kevin85421 commented Jan 12, 2024

architkulkarni commented Jan 12, 2024

[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

Conversation

kevin85421 commented Jan 11, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Jan 12, 2024

andrewsykim Jan 12, 2024

Choose a reason for hiding this comment

andrewsykim Jan 12, 2024

Choose a reason for hiding this comment

kevin85421 Jan 12, 2024

Choose a reason for hiding this comment

andrewsykim Jan 12, 2024

Choose a reason for hiding this comment

kevin85421 Jan 12, 2024

Choose a reason for hiding this comment

andrewsykim Jan 12, 2024

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni commented Jan 12, 2024

kevin85421 commented Jan 12, 2024

architkulkarni commented Jan 12, 2024

kevin85421 commented Jan 11, 2024 •

edited

Loading