-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay #1831
Conversation
cc @andrewsykim |
if !job.DeletionTimestamp.IsZero() { | ||
r.Log.Info("The Job deletion is ongoing.", "RayJob", rayJobInstance.Name, "Submitter K8s Job", job.Name) | ||
} else { | ||
if err := r.Client.Delete(ctx, job, client.PropagationPolicy(metav1.DeletePropagationBackground)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a common problem with RayJobs that are immediately deleted is that it can be difficult to troubleshoot long running jobs because everything is cleaned up immedaitely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future consideration: since we are controlling the lifecycle of the submitter job, it might be worthwhile to allow users to tune the TTL of only the submitter job. This way you can keep the logs from the job around without having to keep the cluster around.
This makes sense to me. Are you interested in opening a PR for this after this PR got merged? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Would you mind opening an issue for it, so that I can assign it to you? (If I open the issue, I can't assign it to you unless you comment on it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I don't understand why it depends on |
|
Ah thanks, makes sense. I misunderstood and thought x referred to all past successful RayJobs (including completely unrelated RayJobs) |
Why are these changes needed?
If users create a RayJob with a buggy
entrypoint
, causing the submitter K8s Job to consistently fail, it won't respect thettlSecondsAfterFinished
value set by users.The submitter Job sends a request to the Ray head to create a Ray job. Due to the buggy
entrypoint
, the Ray job fails immediately, and KubeRay transitions the RayJob status toComplete
. AfterttlSecondsAfterFinished
seconds following the transition of the RayJob toComplete
, the RayCluster will be deleted.The submitter K8s Job will retry 3 times by default. If the submitter Job attempts to send a request to the Ray head after the Ray head has been deleted, it will result in a request timeout and failure. I haven't figured out the exact timeout value, but based on my observation, it's around 1 to 2 minutes. Therefore, the K8s Job may be deleted after the RayCluster is deleted, typically within
(x-1) * time_to_fail + n * timeout_seconds + ttlSecondsAfterFinished
seconds.x
: The count of Ray jobs successfully submitted by KubeRay to the Ray head.n
: The count of attempts made by the K8s Job to submit the Ray job to the RayCluster when the Ray head is unavailable. Currently,n
is less or equal to 2.This PR decides not to use K8s Job's built-in
ttlSecondsAfterFinished
, but control the entire lifecycle of the K8s submitter Job by KubeRay.Related issue number
Checks
I performed an experiment based on this gist.
entrypoint: python /home/ray/samples/sample_code_1.py
=>sample_code_1.py
doesn't existttlSecondsAfterFinished: 10
In my experiment, with
x
as 2 andn
as 1, the K8s Job is deleted 150 seconds after the RayCluster is deleted.