-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][14/n] Decouple the Initializing status and Running status #1801
Conversation
if rayClusterInstance, err = r.getOrCreateRayClusterInstance(ctx, rayJobInstance); err != nil { | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err | ||
} | ||
// If there is no cluster instance and no error, suspend the job deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this case happen? Why wouldnt getOrCreateRayClusterInstance
fail with err if no cluster instance was created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Currently, the only scenario in which the function getOrCreateRayClusterInstance
returns nil, nil
is when suspend
is set to true. It is not that straight-forward. I may refactor it in the following PRs.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 578 to 581 in 448e33d
// special case: don't create a cluster instance and don't return an error if the suspend flag of the job is true | |
if rayJobInstance.Spec.Suspend { | |
return nil, nil | |
} |
} | ||
|
||
// Ensure k8s job has been created | ||
if err := r.createK8sJobIfNeed(ctx, rayJobInstance, rayClusterInstance); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably log the err here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All errors (err
) have already been logged immediately after they are caught. See createK8sJobIfNeed and createK8sJobIfNeed for more details. In addition, if the Reconcile
function returns a non-nil err
, the reconciler will catch the error internally, and you will see an error message similar to the one below.
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayJobReconciler).Reconcile
/workspace/controllers/ray/rayjob_controller.go:280
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227
r.Log.Info("Both RayCluster and the submitter K8s Job are created. Transition the status from `Initializing` to `Running`.", | ||
"RayJob", rayJobInstance.Name, "RayCluster", rayJobInstance.Status.RayClusterName) | ||
if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusRunning); err != nil { | ||
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case what happens if reconciler runs after RayJobDefaultRequeueDuration
. Would we create duplicate cluster in line 135 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a queue that stores resource events capable of triggering reconciliation. RequeueAfter: RayJobDefaultRequeueDuration
implies that an event for this RayJob custom resource is placed in the queue to trigger reconciliation after RayJobDefaultRequeueDuration
seconds.
Would we create duplicate cluster in line 135 ?
There is a delay between our local cache and the Kubernetes API server. This means the RayCluster may have already been created in the Kubernetes API server, but our local cache hasn't completed synchronization yet. Therefore, we call the Create function again to create the RayCluster.
To avoid the RayCluster double creation, we use the same RayCluster name for the same RayJob custom resource for each reconciliation loop. Hence, if we call the Create
function twice, the second one will be denied by the Kubernetes API server because the RayCluster already exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah cool. Thanks for confirming that
Why are these changes needed?
Currently, the code paths for
Initializing
andRunning
are strongly coupled. This PR decouples them to facilitate further refactoring and to build a well-defined state machine.Initializing
is responsible for creating the RayCluster, setting DashboardURL, and creating the submitter K8s Job.Running
is responsible for monitoringJobStatus
.Related issue number
Checks
All existing e2e tests cover this code path.