Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob] Fix RayJob status reconciliation #1539

Merged
merged 6 commits into from
Oct 24, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[RayJob]: Set RayJob as the owner of the batch Job
astefanutti committed Oct 23, 2023
commit 456188f4c7b5180a611dd1235bdd158b997eb24d
6 changes: 3 additions & 3 deletions ray-operator/controllers/ray/rayjob_controller.go
Original file line number Diff line number Diff line change
@@ -343,7 +343,7 @@ func (r *RayJobReconciler) getOrCreateK8sJob(ctx context.Context, rayJobInstance
r.Log.Error(err, "failed to get submitter template")
return "", false, err
}
return r.createNewK8sJob(ctx, rayJobInstance, submitterTemplate, rayClusterInstance)
return r.createNewK8sJob(ctx, rayJobInstance, submitterTemplate)
}

// Some other error occurred while trying to get the Job
@@ -395,7 +395,7 @@ func (r *RayJobReconciler) getSubmitterTemplate(rayJobInstance *rayv1.RayJob, ra
}

// createNewK8sJob creates a new Kubernetes Job. It returns the Job's name and a boolean indicating whether a new Job was created.
func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *rayv1.RayJob, submitterTemplate v1.PodTemplateSpec, rayClusterInstance *rayv1.RayCluster) (string, bool, error) {
func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *rayv1.RayJob, submitterTemplate v1.PodTemplateSpec) (string, bool, error) {
job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: rayJobInstance.Name,
@@ -407,7 +407,7 @@ func (r *RayJobReconciler) createNewK8sJob(ctx context.Context, rayJobInstance *
}

// Set the ownership in order to do the garbage collection by k8s.
if err := ctrl.SetControllerReference(rayClusterInstance, job, r.Scheme); err != nil {
if err := ctrl.SetControllerReference(rayJobInstance, job, r.Scheme); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might implement retry logic for RayJob in the future. If RayJob retries, it will spawn a new RayCluster and a new K8s Job. Therefore, the controller for the K8s Job should be the RayCluster.

Copy link
Contributor Author

@astefanutti astefanutti Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will require more thoughts, depending on the evolution of the retry mechanism, and the RayCluster reuse case. Yet at the moment, as a user, I find it surprising that the Job is not cascade-deleted when I delete the RayJob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet at the moment, as a user, I find it surprising that the Job is not cascade-deleted when I delete the RayJob.

Really? It should be deleted. If not, it is a bug. RayJob is the controller reference of the RayCluster, and the RayCluster is the controller reference of the Kubernetes Job. Hence, I expected that the K8s Job should be removed if RayJob is deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the case the RayJob does not own the RayCluster, that is when it's created with a clusterSelector field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2023-10-19 at 9 53 36 AM

I conducted an experiment, and I do not observe the behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the case the RayJob does not own the RayCluster, that is when it's created with a clusterSelector field.

Got it. We do not encourage users to use that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may deprecate it in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the comment at https://github.com/ray-project/kuberay/pull/1539/files#r1365854531. I hadn't seen your reply when I wrote mine.

r.Log.Error(err, "failed to set controller reference")
return "", false, err
}