[RayJob][Status][12/n] Resume suspended RayJob #1783

kevin85421 · 2023-12-29T01:50:33Z

Why are these changes needed?

#1782 mentions that if users don't set shutDownAfterFinishes, the submitter Job will become Complete after suspension. When we resume the RayJob, a new Kubernetes Job will not be created because one already exists. Note that we currently use the RayJob name for the submitter Job to avoid duplicate creation.

Kubernetes Job cleanup
- The default deletion policy of K8s Job is orphanDependents. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you use kubectl to delete the Job, the related Pods will be deleted.
- ttlSecondsAfterFinished
- suspend: When you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a SIGTERM signal. However, in our case, the submitter Job will become Complete after the KubeRay operator sends the StopJob() request to the Ray head. Hence, suspending the K8s Job and then resuming the Job doesn't work.
- Manually delete both K8s Job and its Pods: ttlSecondsAfterFinished is a more k8s-native solution.

This PR deletes the Kubernetes Job when it is suspended by updating its ttlSecondsAfterFinished.

Follow-up: Currently, suspend is reversible, meaning that users can set suspend to true while in Initializing and Running, and then revert it to false before the status becomes Suspended. However, this may lead to unexpected side effects. Therefore, we should make this operation irreversible.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2023-12-29T02:04:46Z

ray-operator/controllers/ray/rayjob_controller.go

+	// Since the name of the Kubernetes Job is the same as the RayJob, we need to set the TTL of the Kubernetes Job to clean
+	// up the Kubernetes Job and its Pods when suspending, and a new submitter Kubernetes Job must be created to resubmit the
+	// Ray job if the RayJob is resumed.
+	if isSuspend {


We can't use Spec.Suspend because it may be set to true when the RayJob is in a Complete state, which is not allowed to transition to Suspended.

kevin85421 · 2023-12-29T02:21:56Z

ray-operator/test/e2e/rayjob_suspend_test.go

+		// Refresh the RayJob status
+		rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name)
+
+		test.T().Logf("Resume the RayJob by updating `suspend` to false.")


@astefanutti Do you have any recommendations on whether we should split Suspend and Resume into two tests and these two tests in order, or is it acceptable to run both of them in a single test? Thanks!

I will merge this PR to move forward. If you have any recommendations, I can address them in subsequent PRs for further improvements. Thanks!

@kevin85421 sorry for the delay, I've just returned from PTO.

This looks like a very good e2e test already, though I could suggest to have:

A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with .spec.suspend = true at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion

Some other tests that would cover the other allowed transitions from / to the suspended state.

Thanks! I have already opened #1800 to track the progress.

kevin85421 · 2023-12-29T04:46:24Z

ray-operator/controllers/ray/rayjob_controller.go

+				return err
+			}
+		}
+		job.Spec.TTLSecondsAfterFinished = pointer.Int32(0)


I have currently set this to zero to avoid some race conditions. We can set a non-zero value after the suspend operation becomes irreversible. Additionally, the original suspend support (0.5.0?) also does not include a grace period.

architkulkarni

Looks good to me!
Just to confirm:

The underlying pod is not deleted in this PR, right? What's the plan for how to delete the pod in the followup?
Thanks for adding the test. Does the test fail without the code changes in this PR?

kevin85421 · 2023-12-29T19:42:42Z

The underlying pod is not deleted in this PR, right? What's the plan for how to delete the pod in the followup?

No. The K8s Job and Pods created by K8s Job will be deleted when we suspend a RayJob because we set ttlSecondsAfterFinished. The Kubernetes official doc says that "When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job.".

Does the test fail without the code changes in this PR?

Yes.

astefanutti

A couple of notes that could be useful:

The default deletion policy of K8s Job is orphanDependents. Hence, if you delete the Job via the client-go API, the associated Pods will not be deleted. If you use kubectl to delete the Job, the related Pods will be deleted.

It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.
Kueue does not admit RayJob with shutDownAfterFinishes = false, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.

astefanutti · 2024-01-03T09:01:29Z

ray-operator/test/e2e/rayjob_suspend_test.go

+		// Refresh the RayJob status
+		rayJob = GetRayJob(test, rayJob.Namespace, rayJob.Name)
+
+		test.T().Logf("Resume the RayJob by updating `suspend` to false.")


@kevin85421 sorry for the delay, I've just returned from PTO.

This looks like a very good e2e test already, though I could suggest to have:

A separate test that would closely cover the Kueue happy-path scenario, i.e., creating a suspended RayJob, that is with .spec.suspend = true at creation time (this is done by Kueue with a webhook), resume it manually, and check successful completion

Some other tests that would cover the other allowed transitions from / to the suspended state.

astefanutti · 2024-01-03T09:19:02Z

ray-operator/test/e2e/rayjob_suspend_test.go

+		test.T().Logf("Suspend the RayJob %s/%s", rayJob.Namespace, rayJob.Name)
+		rayJob.Spec.Suspend = true
+		// TODO (kevin85421): We may need to retry `Update` if 409 conflict makes the test flaky.
+		rayJob, err = test.Client().Ray().RayV1().RayJobs(namespace.Name).Update(test.Ctx(), rayJob, metav1.UpdateOptions{})


A patch operation in that case could avoid dealing with conflicts.

I am not familiar with patch. As I understand it, since it is not protected by resourceVersion, it could potentially lead to race conditions. Is it safe to use the patch operation in our scenario?

That depends on the type of the patch request. Server-side apply has some conflict resolution mechanism relying on managed fields, but that's true for the others like JSON patch, merge patch, or strategic merge patch, there is no optimistic locking mechanisms in place (like checking resourceVersion).

That being said, in our scenario here, I think it's safe as 1) the patch is localised to .spec.suspend, 2) the operator is not supposed to mutate the RayJob spec often, and even if/when is does, it would either do an update operation that would fail because of resourceVersion mismatch, or do patch operations on fields other than .spec.suspend.

kevin85421 · 2024-01-03T18:50:01Z

It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.

Thanks! #1791 has already updated TTLSecondsAfterFinished to PropagationPolicy(metav1.DeletePropagationBackground). cc @rueian

Kueue does not admit RayJob with shutDownAfterFinishes = false, so at KubeRay level, we may want to consider this an invalid configuration for suspended RayJobs.

Good to know! Thanks!

astefanutti · 2024-01-04T08:27:47Z

It's possible to define the deletion propagation using the PropagationPolicy option from metav1.DeleteOptions, either with client-go (generated) client, or the controller-runtime client. It may provide a cleaner semantic than relying on TTLSecondsAfterFinished.

Thanks! #1791 has already updated TTLSecondsAfterFinished to PropagationPolicy(metav1.DeletePropagationBackground). cc @rueian

Oh I missed that 👍🏼. Thanks!

kevin85421 added 2 commits December 29, 2023 01:49

update

f095a1f

update

fd304eb

kevin85421 commented Dec 29, 2023

View reviewed changes

kevin85421 marked this pull request as ready for review December 29, 2023 02:22

kevin85421 requested a review from architkulkarni December 29, 2023 02:22

kevin85421 assigned architkulkarni Dec 29, 2023

kevin85421 commented Dec 29, 2023

View reviewed changes

architkulkarni approved these changes Dec 29, 2023

View reviewed changes

kevin85421 merged commit 349068d into ray-project:master Dec 29, 2023
24 checks passed

astefanutti reviewed Jan 3, 2024

View reviewed changes

kevin85421 mentioned this pull request Jan 3, 2024

[Test][RayJob] Suspend operation #1800

Closed

2 tasks

This was referenced Jan 6, 2024

[Test][RayJob] Kueue happy-path scenario #1809

Merged

[RayJob] Validate RayJob spec #1813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob][Status][12/n] Resume suspended RayJob #1783

[RayJob][Status][12/n] Resume suspended RayJob #1783

kevin85421 commented Dec 29, 2023 •

edited

Loading

kevin85421 Dec 29, 2023

kevin85421 Dec 29, 2023

kevin85421 Dec 29, 2023

astefanutti Jan 3, 2024

kevin85421 Jan 3, 2024

kevin85421 Dec 29, 2023

architkulkarni left a comment

kevin85421 commented Dec 29, 2023

astefanutti left a comment

astefanutti Jan 3, 2024

astefanutti Jan 3, 2024

kevin85421 Jan 3, 2024

astefanutti Jan 4, 2024

kevin85421 commented Jan 3, 2024

astefanutti commented Jan 4, 2024

[RayJob][Status][12/n] Resume suspended RayJob #1783

[RayJob][Status][12/n] Resume suspended RayJob #1783

Conversation

kevin85421 commented Dec 29, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Dec 29, 2023

astefanutti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevin85421 commented Jan 3, 2024

astefanutti commented Jan 4, 2024

kevin85421 commented Dec 29, 2023 •

edited

Loading