-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743
[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743
Conversation
12251df
to
a7c6f9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a welcome refactor!
In this PR we chose to remove a lot of JobDeploymentStatus values because those values are never used by KubeRay to make a decision. However, at first glance they might still have some value for the user when debugging a job. Are we making an explicit tradeoff here for a simpler state machine, at the expense of some debuggability? Or is it generally a best practice to only define status values if the controller uses the status to make a decision?
In any case, I think it's a fair tradeoff, since the user will typically just look at the logs anyway for debugging, so it's not that important to have fine-grained statuses. So I think the current PR is fine.
@@ -645,11 +633,6 @@ func (r *RayJobReconciler) getOrCreateRayClusterInstance(ctx context.Context, ra | |||
return nil, err | |||
} | |||
|
|||
// special case: is the job is complete status and cluster has been recycled. | |||
if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) && rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need this anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to explain it in the PR description.
-
If the status is
JobDeploymentStatusComplete
, KubeRay will skip the reconciliation right at the start of the reconcile process.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 128 to 131 in f56c66f
if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete { r.Log.Info("rayjob is complete, skip reconciliation", "rayjob", rayJobInstance.Name) return ctrl.Result{}, nil } -
If the status updates to
JobDeploymentStatusComplete
successfully, KubeRay will skip the reconciliation immediately.
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 138 to 161 in f56c66f
if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) { // If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation. rayClusterInstance := &rayv1.RayCluster{} rayClusterNamespacedName := types.NamespacedName{ Namespace: rayJobInstance.Namespace, Name: rayJobInstance.Status.RayClusterName, } if err := r.Get(ctx, rayClusterNamespacedName, rayClusterInstance); err != nil { if !errors.IsNotFound(err) { return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err } if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil { return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err } return ctrl.Result{}, nil } if rayClusterInstance.DeletionTimestamp != nil { if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil { return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err } return ctrl.Result{}, nil } }
These are the only two occurrences of JobDeploymentStatusComplete
, except for the deletion at L649. Therefore, the condition rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete
cannot be true, meaning the if
statement will never be executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, makes sense!
For improved observability, we could add a new string field (e.g.,
In my opinion, each status should have a specific associated goal, and for different statuses to lead to different actions. The spark-on-k8s-operator has a similar pattern as RayJob, and its state machine is here. In the state machine of the Spark operator, the function |
Got it, thanks a lot for the additional context! |
Why are these changes needed?
It is impossible to have a stable RayJob controller without a well-defined state machine. However, it is pretty hard to implement a state machine in the current code base because: (1) There are a lot of undefined
JobDeploymentStatus
(2)updateState
is called everywhere.This PR removes some
JobDeploymentStatus
:JobDeploymentStatusFailedToGetOrCreateRayCluster
: KubeRay doesn't use the status to make any decision.JobDeploymentStatusWaitForK8sJob
: KubeRay doesn't use the status to make any decision.JobDeploymentStatusFailedJobDeploy
: KubeRay doesn't use the status to make any decision.JobDeploymentStatusFailedToGetJobStatus
: It is used by a functionshouldUpdateJobStatus
. However, the check seems to be unnecessary. BothjobInfo.JobStatus
andrayv1.JobDeploymentStatusRunning
are the source of truth.This PR updates
updateState(..., err error)
toupdateState(..., err error)
.JobDeploymentStatus
mentioned above, only 1updateState
function call'serr
is not explicitly set tonil
. However, if you trace the code,err
must benil
inr.updateState(...rayv1.JobDeploymentStatusSuspended, err)
.This PR removes numerous
updateState
function calls. It's preferable to minimize the number of places where the CR status is updated.Related issue number
Checks