Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls #1743

Merged
merged 6 commits into from
Dec 14, 2023

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Dec 13, 2023

Why are these changes needed?

It is impossible to have a stable RayJob controller without a well-defined state machine. However, it is pretty hard to implement a state machine in the current code base because: (1) There are a lot of undefined JobDeploymentStatus (2) updateState is called everywhere.

  • This PR removes some JobDeploymentStatus:

    • JobDeploymentStatusFailedToGetOrCreateRayCluster: KubeRay doesn't use the status to make any decision.
    • JobDeploymentStatusWaitForK8sJob: KubeRay doesn't use the status to make any decision.
    • JobDeploymentStatusFailedJobDeploy: KubeRay doesn't use the status to make any decision.
    • JobDeploymentStatusFailedToGetJobStatus: It is used by a function shouldUpdateJobStatus. However, the check seems to be unnecessary. Both jobInfo.JobStatus and rayv1.JobDeploymentStatusRunning are the source of truth.
      if r.shouldUpdateJobStatus(rayJobInstance.Status.JobStatus, rayJobInstance.Status.JobDeploymentStatus, jobInfo) {
        err = r.updateState(ctx, rayJobInstance, jobInfo, jobInfo.JobStatus, rayv1.JobDeploymentStatusRunning, nil)
      }
  • This PR updates updateState(..., err error) to updateState(..., err error).

    • After I remove the JobDeploymentStatus mentioned above, only 1 updateState function call's err is not explicitly set to nil. However, if you trace the code, err must be nil in r.updateState(...rayv1.JobDeploymentStatusSuspended, err).
  • This PR removes numerous updateState function calls. It's preferable to minimize the number of places where the CR status is updated.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title [RayJob][Status][4/n] Remove JobDeploymentStatusFailedToGetOrCreateRayCluster [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls Dec 13, 2023
@kevin85421 kevin85421 marked this pull request as ready for review December 13, 2023 21:41
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is a welcome refactor!

In this PR we chose to remove a lot of JobDeploymentStatus values because those values are never used by KubeRay to make a decision. However, at first glance they might still have some value for the user when debugging a job. Are we making an explicit tradeoff here for a simpler state machine, at the expense of some debuggability? Or is it generally a best practice to only define status values if the controller uses the status to make a decision?

In any case, I think it's a fair tradeoff, since the user will typically just look at the logs anyway for debugging, so it's not that important to have fine-grained statuses. So I think the current PR is fine.

@@ -645,11 +633,6 @@ func (r *RayJobReconciler) getOrCreateRayClusterInstance(ctx context.Context, ra
return nil, err
}

// special case: is the job is complete status and cluster has been recycled.
if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) && rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need this anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to explain it in the PR description.

  1. If the status is JobDeploymentStatusComplete, KubeRay will skip the reconciliation right at the start of the reconcile process.

    if rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete {
    r.Log.Info("rayjob is complete, skip reconciliation", "rayjob", rayJobInstance.Name)
    return ctrl.Result{}, nil
    }

  2. If the status updates to JobDeploymentStatusComplete successfully, KubeRay will skip the reconciliation immediately.

    if isJobSucceedOrFail(rayJobInstance.Status.JobStatus) {
    // If the function `updateState` updates the JobStatus to Complete successfully, we can skip the reconciliation.
    rayClusterInstance := &rayv1.RayCluster{}
    rayClusterNamespacedName := types.NamespacedName{
    Namespace: rayJobInstance.Namespace,
    Name: rayJobInstance.Status.RayClusterName,
    }
    if err := r.Get(ctx, rayClusterNamespacedName, rayClusterInstance); err != nil {
    if !errors.IsNotFound(err) {
    return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
    }
    if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {
    return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
    }
    return ctrl.Result{}, nil
    }
    if rayClusterInstance.DeletionTimestamp != nil {
    if err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusComplete, nil); err != nil {
    return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
    }
    return ctrl.Result{}, nil
    }
    }

These are the only two occurrences of JobDeploymentStatusComplete, except for the deletion at L649. Therefore, the condition rayJobInstance.Status.JobDeploymentStatus == rayv1.JobDeploymentStatusComplete cannot be true, meaning the if statement will never be executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense!

@kevin85421
Copy link
Member Author

In this PR we chose to remove a lot of JobDeploymentStatus values because those values are never used by KubeRay to make a decision. However, at first glance they might still have some value for the user when debugging a job. Are we making an explicit tradeoff here for a simpler state machine, at the expense of some debuggability?

For improved observability, we could add a new string field (e.g., Details) to provide more detailed descriptions. I may add some states back if I find it necessary. At this moment, I just want to simplify the states as much as possible before I start to work on the new state machine.

Or is it generally a best practice to only define status values if the controller uses the status to make a decision?

In my opinion, each status should have a specific associated goal, and for different statuses to lead to different actions. The spark-on-k8s-operator has a similar pattern as RayJob, and its state machine is here. In the state machine of the Spark operator, the function syncSparkApplication makes the decision based on the status, and I plan to build a similar but simplified one for RayJob as the first step. We can add some new states when we decide to support some new features like retry. The following is my note about the state machine. There are three key functions:

Screen Shot 2023-12-13 at 4 58 35 PM

@architkulkarni
Copy link
Contributor

Got it, thanks a lot for the additional context!

@kevin85421 kevin85421 merged commit 62bbc13 into ray-project:master Dec 14, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants