[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

kevin85421 · 2024-01-03T20:17:02Z

Why are these changes needed?

Currently, the code paths for Initializing and Running are strongly coupled. This PR decouples them to facilitate further refactoring and to build a well-defined state machine.

Initializing is responsible for creating the RayCluster, setting DashboardURL, and creating the submitter K8s Job.
Running is responsible for monitoring JobStatus.

Related issue number

Checks

All existing e2e tests cover this code path.

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

gvspraveen · 2024-01-03T22:41:15Z

ray-operator/controllers/ray/rayjob_controller.go

+		if rayClusterInstance, err = r.getOrCreateRayClusterInstance(ctx, rayJobInstance); err != nil {
+			return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
+		}
+		// If there is no cluster instance and no error, suspend the job deployment


when would this case happen? Why wouldnt getOrCreateRayClusterInstance fail with err if no cluster instance was created

Good question. Currently, the only scenario in which the function getOrCreateRayClusterInstance returns nil, nil is when suspend is set to true. It is not that straight-forward. I may refactor it in the following PRs.

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 578 to 581 in 448e33d

// special case: don't create a cluster instance and don't return an error if the suspend flag of the job is true

if rayJobInstance.Spec.Suspend {

return nil, nil

}

gvspraveen · 2024-01-03T22:42:19Z

ray-operator/controllers/ray/rayjob_controller.go

+		}
+
+		// Ensure k8s job has been created
+		if err := r.createK8sJobIfNeed(ctx, rayJobInstance, rayClusterInstance); err != nil {


we should probably log the err here

All errors (err) have already been logged immediately after they are caught. See createK8sJobIfNeed and createK8sJobIfNeed for more details. In addition, if the Reconcile function returns a non-nil err, the reconciler will catch the error internally, and you will see an error message similar to the one below.

github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile /workspace/controllers/ray/rayjob_controller.go:280 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227

gvspraveen · 2024-01-03T22:46:55Z

ray-operator/controllers/ray/rayjob_controller.go

+		r.Log.Info("Both RayCluster and the submitter K8s Job are created. Transition the status from `Initializing` to `Running`.",
+			"RayJob", rayJobInstance.Name, "RayCluster", rayJobInstance.Status.RayClusterName)
+		if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusRunning); err != nil {
+			return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err


in this case what happens if reconciler runs after RayJobDefaultRequeueDuration. Would we create duplicate cluster in line 135 ?

There is a queue that stores resource events capable of triggering reconciliation. RequeueAfter: RayJobDefaultRequeueDuration implies that an event for this RayJob custom resource is placed in the queue to trigger reconciliation after RayJobDefaultRequeueDuration seconds.

Would we create duplicate cluster in line 135 ?

There is a delay between our local cache and the Kubernetes API server. This means the RayCluster may have already been created in the Kubernetes API server, but our local cache hasn't completed synchronization yet. Therefore, we call the Create function again to create the RayCluster.

To avoid the RayCluster double creation, we use the same RayCluster name for the same RayJob custom resource for each reconciliation loop. Hence, if we call the Create function twice, the second one will be denied by the Kubernetes API server because the RayCluster already exists.

aah cool. Thanks for confirming that

update

bdd9e85

kevin85421 changed the title ~~[RayJob][Status][14/n] Decouple the Initializing status and Running status~~ [WIP][RayJob][Status][14/n] Decouple the Initializing status and Running status Jan 3, 2024

fix lint

9e63b6e

kevin85421 changed the title ~~[WIP][RayJob][Status][14/n] Decouple the Initializing status and Running status~~ [RayJob][Status][14/n] Decouple the Initializing status and Running status Jan 3, 2024

kevin85421 marked this pull request as ready for review January 3, 2024 22:09

kevin85421 requested a review from gvspraveen January 3, 2024 22:14

kevin85421 assigned gvspraveen Jan 3, 2024

gvspraveen reviewed Jan 3, 2024

View reviewed changes

gvspraveen approved these changes Jan 3, 2024

View reviewed changes

kevin85421 merged commit a9c7abb into ray-project:master Jan 3, 2024
24 checks passed

kevin85421 mentioned this pull request Jan 5, 2024

[RayJob][Status][15/n] Unify the codepath for the status transition to Suspended #1805

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

kevin85421 commented Jan 3, 2024 •

edited

Loading

gvspraveen Jan 3, 2024

kevin85421 Jan 3, 2024 •

edited

Loading

gvspraveen Jan 3, 2024

kevin85421 Jan 3, 2024

gvspraveen Jan 3, 2024 •

edited

Loading

kevin85421 Jan 3, 2024

gvspraveen Jan 3, 2024

	// special case: don't create a cluster instance and don't return an error if the suspend flag of the job is true
	if rayJobInstance.Spec.Suspend {
	return nil, nil
	}

[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

Conversation

kevin85421 commented Jan 3, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

gvspraveen Jan 3, 2024

Choose a reason for hiding this comment

kevin85421 Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

gvspraveen Jan 3, 2024

Choose a reason for hiding this comment

kevin85421 Jan 3, 2024

Choose a reason for hiding this comment

gvspraveen Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

kevin85421 Jan 3, 2024

Choose a reason for hiding this comment

gvspraveen Jan 3, 2024

Choose a reason for hiding this comment

kevin85421 commented Jan 3, 2024 •

edited

Loading

kevin85421 Jan 3, 2024 •

edited

Loading

gvspraveen Jan 3, 2024 •

edited

Loading