Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][12/n] Resume suspended RayJob #1783

Merged
merged 2 commits into from
Dec 29, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 29, 2023

Why are these changes needed?

#1782 mentions that if users don't set shutDownAfterFinishes, the submitter Job will become Complete after suspension. When we resume the RayJob, a new Kubernetes Job will not be created because one already exists. Note that we currently use the RayJob name for the submitter Job to avoid duplicate creation.

  • Kubernetes Job cleanup
    • The default deletion policy of K8s Job is orphanDependents. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you use kubectl to delete the Job, the related Pods will be deleted.
    • ttlSecondsAfterFinished
    • suspend: When you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a SIGTERM signal. However, in our case, the submitter Job will become Complete after the KubeRay operator sends the StopJob() request to the Ray head. Hence, suspending the K8s Job and then resuming the Job doesn't work.
    • Manually delete both K8s Job and its Pods: ttlSecondsAfterFinished is a more k8s-native solution.

This PR deletes the Kubernetes Job when it is suspended by updating its ttlSecondsAfterFinished.

Follow-up: Currently, suspend is reversible, meaning that users can set suspend to true while in Initializing and Running, and then revert it to false before the status becomes Suspended. However, this may lead to unexpected side effects. Therefore, we should make this operation irreversible.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

// Since the name of the Kubernetes Job is the same as the RayJob, we need to set the TTL of the Kubernetes Job to clean
// up the Kubernetes Job and its Pods when suspending, and a new submitter Kubernetes Job must be created to resubmit the
// Ray job if the RayJob is resumed.
if isSuspend {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use Spec.Suspend because it may be set to true when the RayJob is in a Complete state, which is not allowed to transition to Suspended.

// Refresh the RayJob status
rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name)

test.T().Logf("Resume the RayJob by updating `suspend` to false.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Do you have any recommendations on whether we should split Suspend and Resume into two tests and these two tests in order, or is it acceptable to run both of them in a single test? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this PR to move forward. If you have any recommendations, I can address them in subsequent PRs for further improvements. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 sorry for the delay, I've just returned from PTO.

This looks like a very good e2e test already, though I could suggest to have:

  • A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with .spec.suspend = true at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion
  • Some other tests that would cover the other allowed transitions from / to the suspended state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have already opened #1800 to track the progress.

@kevin85421 kevin85421 marked this pull request as ready for review December 29, 2023 02:22
return err
}
}
job.Spec.TTLSecondsAfterFinished = pointer.Int32(0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have currently set this to zero to avoid some race conditions. We can set a non-zero value after the suspend operation becomes irreversible. Additionally, the original suspend support (0.5.0?) also does not include a grace period.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Just to confirm:

  • The underlying pod is not deleted in this PR, right? What's the plan for how to delete the pod in the followup?
  • Thanks for adding the test. Does the test fail without the code changes in this PR?

@kevin85421
Copy link
Member Author

The underlying pod is not deleted in this PR, right? What's the plan for how to delete the pod in the followup?

No. The K8s Job and Pods created by K8s Job will be deleted when we suspend a RayJob because we set ttlSecondsAfterFinished. The Kubernetes official doc says that "When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job.".

Does the test fail without the code changes in this PR?

Yes.

@kevin85421 kevin85421 merged commit 349068d into ray-project:master Dec 29, 2023
24 checks passed
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes that could be useful:

  • The default deletion policy of K8s Job is orphanDependents. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you use kubectl to delete the Job, the related Pods will be deleted.

    It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.

  • Kueue does not admit RayJob with shutDownAfterFinishes = false, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.

// Refresh the RayJob status
rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name)

test.T().Logf("Resume the RayJob by updating `suspend` to false.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 sorry for the delay, I've just returned from PTO.

This looks like a very good e2e test already, though I could suggest to have:

  • A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with .spec.suspend = true at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion
  • Some other tests that would cover the other allowed transitions from / to the suspended state.

test.T().Logf("Suspend the RayJob %s/%s", rayJob.Namespace, rayJob.Name)
rayJob.Spec.Suspend = true
// TODO (kevin85421): We may need to retry `Update` if 409 conflict makes the test flaky.
rayJob, err = test.Client().Ray().RayV1().RayJobs(namespace.Name).Update(test.Ctx(), rayJob, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A patch operation in that case could avoid dealing with conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with patch. As I understand it, since it is not protected by resourceVersion, it could potentially lead to race conditions. Is it safe to use the patch operation in our scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the type of the patch request. Server-side apply has some conflict resolution mechanism relying on managed fields, but that's true for the others like JSON patch, merge patch, or strategic merge patch, there is no optimistic locking mechanisms in place (like checking resourceVersion).

That being said, in our scenario here, I think it's safe as 1) the patch is localised to .spec.suspend, 2) the operator is not supposed to mutate the RayJob spec often, and even if/when is does, it would either do an update operation that would fail because of resourceVersion mismatch, or do patch operations on fields other than .spec.suspend.

@kevin85421
Copy link
Member Author

It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.

Thanks! #1791 has already updated TTLSecondsAfterFinished to PropagationPolicy(metav1.DeletePropagationBackground). cc @rueian

Kueue does not admit RayJob with shutDownAfterFinishes = false, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.

Good to know! Thanks!

@kevin85421 kevin85421 mentioned this pull request Jan 3, 2024
2 tasks
@astefanutti
Copy link
Contributor

It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.

Thanks! #1791 has already updated TTLSecondsAfterFinished to PropagationPolicy(metav1.DeletePropagationBackground). cc @rueian

Oh I missed that 👍🏼. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants