-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][12/n] Resume suspended RayJob #1783
[RayJob][Status][12/n] Resume suspended RayJob #1783
Conversation
// Since the name of the Kubernetes Job is the same as the RayJob, we need to set the TTL of the Kubernetes Job to clean | ||
// up the Kubernetes Job and its Pods when suspending, and a new submitter Kubernetes Job must be created to resubmit the | ||
// Ray job if the RayJob is resumed. | ||
if isSuspend { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use Spec.Suspend
because it may be set to true when the RayJob is in a Complete
state, which is not allowed to transition to Suspended
.
// Refresh the RayJob status | ||
rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name) | ||
|
||
test.T().Logf("Resume the RayJob by updating `suspend` to false.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Do you have any recommendations on whether we should split Suspend
and Resume
into two tests and these two tests in order, or is it acceptable to run both of them in a single test? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will merge this PR to move forward. If you have any recommendations, I can address them in subsequent PRs for further improvements. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 sorry for the delay, I've just returned from PTO.
This looks like a very good e2e test already, though I could suggest to have:
- A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with
.spec.suspend = true
at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion - Some other tests that would cover the other allowed transitions from / to the suspended state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have already opened #1800 to track the progress.
return err | ||
} | ||
} | ||
job.Spec.TTLSecondsAfterFinished = pointer.Int32(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have currently set this to zero to avoid some race conditions. We can set a non-zero value after the suspend
operation becomes irreversible. Additionally, the original suspend
support (0.5.0?) also does not include a grace period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Just to confirm:
- The underlying pod is not deleted in this PR, right? What's the plan for how to delete the pod in the followup?
- Thanks for adding the test. Does the test fail without the code changes in this PR?
No. The K8s Job and Pods created by K8s Job will be deleted when we
Yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of notes that could be useful:
-
The default deletion policy of K8s Job is orphanDependents. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you use kubectl to delete the Job, the related Pods will be deleted.
It's possible to define the deletion propagation using the
PropagationPolicy
option frommetav1.DeleteOptions
, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying onTTLSecondsAfterFinished
. -
Kueue does not admit RayJob with
shutDownAfterFinishes = false
, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.
// Refresh the RayJob status | ||
rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name) | ||
|
||
test.T().Logf("Resume the RayJob by updating `suspend` to false.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 sorry for the delay, I've just returned from PTO.
This looks like a very good e2e test already, though I could suggest to have:
- A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with
.spec.suspend = true
at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion - Some other tests that would cover the other allowed transitions from / to the suspended state.
test.T().Logf("Suspend the RayJob %s/%s", rayJob.Namespace, rayJob.Name) | ||
rayJob.Spec.Suspend = true | ||
// TODO (kevin85421): We may need to retry `Update` if 409 conflict makes the test flaky. | ||
rayJob, err = test.Client().Ray().RayV1().RayJobs(namespace.Name).Update(test.Ctx(), rayJob, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A patch operation in that case could avoid dealing with conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with patch
. As I understand it, since it is not protected by resourceVersion
, it could potentially lead to race conditions. Is it safe to use the patch
operation in our scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on the type of the patch request. Server-side apply has some conflict resolution mechanism relying on managed fields, but that's true for the others like JSON patch, merge patch, or strategic merge patch, there is no optimistic locking mechanisms in place (like checking resourceVersion
).
That being said, in our scenario here, I think it's safe as 1) the patch is localised to .spec.suspend
, 2) the operator is not supposed to mutate the RayJob spec often, and even if/when is does, it would either do an update operation that would fail because of resourceVersion
mismatch, or do patch operations on fields other than .spec.suspend
.
Thanks! #1791 has already updated
Good to know! Thanks! |
Oh I missed that 👍🏼. Thanks! |
Why are these changes needed?
#1782 mentions that if users don't set
shutDownAfterFinishes
, the submitter Job will becomeComplete
after suspension. When we resume the RayJob, a new Kubernetes Job will not be created because one already exists. Note that we currently use the RayJob name for the submitter Job to avoid duplicate creation.orphanDependents
. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you usekubectl
to delete the Job, the related Pods will be deleted.ttlSecondsAfterFinished
suspend
: When you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a SIGTERM signal. However, in our case, the submitter Job will becomeComplete
after the KubeRay operator sends theStopJob()
request to the Ray head. Hence, suspending the K8s Job and then resuming the Job doesn't work.ttlSecondsAfterFinished
is a more k8s-native solution.This PR deletes the Kubernetes Job when it is suspended by updating its
ttlSecondsAfterFinished
.Follow-up: Currently,
suspend
is reversible, meaning that users can set suspend to true while inInitializing
andRunning
, and then revert it to false before the status becomesSuspended
. However, this may lead to unexpected side effects. Therefore, we should make this operation irreversible.Related issue number
Checks