Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831

Merged
merged 2 commits into from
Jan 12, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 11, 2024

Why are these changes needed?

If users create a RayJob with a buggy entrypoint, causing the submitter K8s Job to consistently fail, it won't respect the ttlSecondsAfterFinished value set by users.

The submitter Job sends a request to the Ray head to create a Ray job. Due to the buggy entrypoint, the Ray job fails immediately, and KubeRay transitions the RayJob status to Complete. After ttlSecondsAfterFinished seconds following the transition of the RayJob to Complete, the RayCluster will be deleted.

The submitter K8s Job will retry 3 times by default. If the submitter Job attempts to send a request to the Ray head after the Ray head has been deleted, it will result in a request timeout and failure. I haven't figured out the exact timeout value, but based on my observation, it's around 1 to 2 minutes. Therefore, the K8s Job may be deleted after the RayCluster is deleted, typically within (x-1) * time_to_fail + n * timeout_seconds + ttlSecondsAfterFinished seconds.

  • x: The count of Ray jobs successfully submitted by KubeRay to the Ray head.
  • n: The count of attempts made by the K8s Job to submit the Ray job to the RayCluster when the Ray head is unavailable. Currently, n is less or equal to 2.

This PR decides not to use K8s Job's built-in ttlSecondsAfterFinished, but control the entire lifecycle of the K8s submitter Job by KubeRay.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

I performed an experiment based on this gist.

  • entrypoint: python /home/ray/samples/sample_code_1.py => sample_code_1.py doesn't exist
  • ttlSecondsAfterFinished: 10

In my experiment, with x as 2 and n as 1, the K8s Job is deleted 150 seconds after the RayCluster is deleted.

@kevin85421 kevin85421 marked this pull request as ready for review January 12, 2024 00:29
@kevin85421
Copy link
Member Author

cc @andrewsykim

if !job.DeletionTimestamp.IsZero() {
r.Log.Info("The Job deletion is ongoing.", "RayJob", rayJobInstance.Name, "Submitter K8s Job", job.Name)
} else {
if err := r.Client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a common problem with RayJobs that are immediately deleted is that it can be difficult to troubleshoot long running jobs because everything is cleaned up immedaitely)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.

This makes sense to me. Are you interested in opening a PR for this after this PR got merged? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would you mind opening an issue for it, so that I can assign it to you? (If I open the issue, I can't assign it to you unless you comment on it.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@architkulkarni
Copy link
Contributor

Therefore, the K8s Job may be deleted after the RayCluster is deleted, typically within (x-1) * time_to_fail + n * timeout_seconds + ttlSecondsAfterFinished seconds.

x: The count of Ray jobs successfully submitted by KubeRay to the Ray head.

I don't understand why it depends on x, is that just something you experimentally observed or do you know the explanation?

@kevin85421
Copy link
Member Author

I don't understand why it depends on x, is that just something you experimentally observed or do you know the explanation?

  1. The submitter Job successfully submits the request to the Ray head.
  2. (1) The submitter Job fails the first time. (2) The Ray job within the RayCluster enters Failed, a terminal state.
  3. The KubeRay operator transitions the RayJob CR to Complete.
  4. The TTL is reached. The deletion of the RayCluster begins.
  5. The submitter Job creates a new Pod and attempts to submit the request to the Ray head again. Since the Pods have not been completely terminated, the Ray head successfully receives the request.
  6. The submitter Job fails for the second time.
  7. All Ray Pods have been completely deleted.
  8. The submitter Job creates a new Pod and attempts to submit the request to the Ray head again. However, as the head Pod no longer exists, it will not fail immediately but will wait until the request times out.
  9. The Job fails three times, reaching the backoffLimit.
  10. The TTL is reached. The deletion of the Job begins.

@kevin85421 kevin85421 merged commit c55f3cc into ray-project:master Jan 12, 2024
24 checks passed
@architkulkarni
Copy link
Contributor

Ah thanks, makes sense. I misunderstood and thought x referred to all past successful RayJobs (including completely unrelated RayJobs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants