Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob][Status][15/n] Unify the codepath for the status transition to Suspended #1805

Merged
merged 2 commits into from
Jan 5, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 4, 2024

Why are these changes needed?

  • Currently, there are two paths to transition to the Suspended status.

    • Initializing -> Suspended
    • Running -> Suspending -> Suspended
  • The case "Initializing -> Suspended" is pretty tricky:

    • Case 1: Set suspend to true before creating the RayCluster.
      • The status will transition to Suspended directly.
    • Case 2: Set suspend to true after the RayCluster is created and before the K8s Job creation.
      • The status will first transition to Running and then to Suspending. Additionally, the user must wait until both the RayCluster and the Kubernetes Job are ready to transition to Running, which may take a considerable amount of time (depending on the time of pulling images).
  • This PR unifies the codepath for the status transition to Suspended.

    • Initializing -> Suspending -> Suspended
      • Users don't need to wait until the RayCluster and K8s Job are ready.
    • Running -> Suspending -> Suspended
  • Note

    • If we don't add WithStatusSubresource to the fake client, the function r.Status().Update(...) will complain that the RayJob cannot be found.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 changed the title WIP [RayJob][Status][15/n] Unify the codepath for the status transition to Suspended Jan 5, 2024
@kevin85421 kevin85421 marked this pull request as ready for review January 5, 2024 00:42
@kevin85421 kevin85421 requested a review from gvspraveen January 5, 2024 00:42
@@ -588,11 +570,6 @@ func (r *RayJobReconciler) getOrCreateRayClusterInstance(ctx context.Context, ra
return nil, err
}

// special case: don't create a cluster instance and don't return an error if the suspend flag of the job is true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1801 (comment) for more details.

err = r.updateState(ctx, rayJobInstance, nil, rayJobInstance.Status.JobStatus, rayv1.JobDeploymentStatusSuspending)
return ctrl.Result{RequeueAfter: RayJobDefaultRequeueDuration}, err
}
if shouldUpdate, err := r.updateStatusToSuspendingIfNeeded(ctx, rayJobInstance); shouldUpdate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks so much more cleaner now. thanks for making this change

@kevin85421 kevin85421 merged commit f654665 into ray-project:master Jan 5, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants