Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801

Merged
merged 2 commits into from
Jan 3, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 3, 2024

Why are these changes needed?

Currently, the code paths for Initializing and Running are strongly coupled. This PR decouples them to facilitate further refactoring and to build a well-defined state machine.

  • Initializing is responsible for creating the RayCluster, setting DashboardURL, and creating the submitter K8s Job.
  • Running is responsible for monitoring JobStatus.

Related issue number

Checks

All existing e2e tests cover this code path.

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [RayJob][Status][14/n] Decouple the Initializing status and Running status [WIP][RayJob][Status][14/n] Decouple the Initializing status and Running status Jan 3, 2024
@kevin85421 kevin85421 changed the title [WIP][RayJob][Status][14/n] Decouple the Initializing status and Running status [RayJob][Status][14/n] Decouple the Initializing status and Running status Jan 3, 2024
@kevin85421 kevin85421 marked this pull request as ready for review January 3, 2024 22:09
@kevin85421 kevin85421 requested a review from gvspraveen January 3, 2024 22:14
if rayClusterInstance, err = r.getOrCreateRayClusterInstance(ctx, rayJobInstance); err != nil {
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
// If there is no cluster instance and no error, suspend the job deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would this case happen? Why wouldnt getOrCreateRayClusterInstance fail with err if no cluster instance was created

Copy link
Member Author

@kevin85421 kevin85421 Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Currently, the only scenario in which the function getOrCreateRayClusterInstance returns nil, nil is when suspend is set to true. It is not that straight-forward. I may refactor it in the following PRs.

// special case: don't create a cluster instance and don't return an error if the suspend flag of the job is true
if rayJobInstance.Spec.Suspend {
return nil, nil
}

}

// Ensure k8s job has been created
if err := r.createK8sJobIfNeed(ctx, rayJobInstance, rayClusterInstance); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably log the err here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All errors (err) have already been logged immediately after they are caught. See createK8sJobIfNeed and createK8sJobIfNeed for more details. In addition, if the Reconcile function returns a non-nil err, the reconciler will catch the error internally, and you will see an error message similar to the one below.

github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile
	/workspace/controllers/ray/rayjob_controller.go:280
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227

r.Log.Info("Both RayCluster and the submitter K8s Job are created. Transition the status from `Initializing` to `Running`.",
"RayJob", rayJobInstance.Name, "RayCluster", rayJobInstance.Status.RayClusterName)
if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusRunning); err != nil {
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
Copy link
Contributor

@gvspraveen gvspraveen Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case what happens if reconciler runs after RayJobDefaultRequeueDuration. Would we create duplicate cluster in line 135 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a queue that stores resource events capable of triggering reconciliation. RequeueAfter: RayJobDefaultRequeueDuration implies that an event for this RayJob custom resource is placed in the queue to trigger reconciliation after RayJobDefaultRequeueDuration seconds.

Would we create duplicate cluster in line 135 ?

There is a delay between our local cache and the Kubernetes API server. This means the RayCluster may have already been created in the Kubernetes API server, but our local cache hasn't completed synchronization yet. Therefore, we call the Create function again to create the RayCluster.

To avoid the RayCluster double creation, we use the same RayCluster name for the same RayJob custom resource for each reconciliation loop. Hence, if we call the Create function twice, the second one will be denied by the Kubernetes API server because the RayCluster already exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah cool. Thanks for confirming that

@kevin85421 kevin85421 merged commit a9c7abb into ray-project:master Jan 3, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants