Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayJob] Support ActiveDeadlineSeconds #1933

Merged
merged 5 commits into from
Feb 21, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Feb 21, 2024

Why are these changes needed?

  • Initially, StartTime refers to the time when the Ray job starts running on the RayCluster. This PR redefines StartTime to the time when JobDeploymentStatus transitioned from New to Initializing.
  • ActiveDeadlineSeconds: Similar to ActiveDeadlineSeconds in K8s Job https://kubernetes.io/docs/concepts/workloads/controllers/job/.
  • We don't check ActiveDeadlineSeconds if the RayJob is in New, Complete, Suspending, and Suspended.
  • We only check ActiveDeadlineSeconds if the RayJob is in the Initializing or Running state, which may occupy some computing resources.
    • If the RayJob passes ActiveDeadlineSeconds, KubeRay will transition the status to Complete.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>

// TODO (kevin85421): For light-weight mode, calculate the number of failed retries and transition
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider implementing backoffLimit if ActiveDeadlineSeconds proves to be insufficient.

@kevin85421 kevin85421 marked this pull request as ready for review February 21, 2024 03:26
@kevin85421 kevin85421 requested a review from jjyao February 21, 2024 03:27
@kevin85421
Copy link
Member Author

cc @andrewsykim would you mind reviewing this PR?

@@ -114,7 +115,6 @@ type RayJobStatus struct {
RayClusterStatus RayClusterStatus `json:"rayClusterStatus,omitempty"`
// observedGeneration is the most recent generation observed for this RayJob. It corresponds to the
// RayJob's generation, which is updated on mutation by the API Server.
// +optional
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that it is useless. There is no change in the CRD YAML after I remove the +optional.

@@ -142,6 +142,11 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request)
if shouldUpdate := r.updateStatusToSuspendingIfNeeded(ctx, rayJobInstance); shouldUpdate {
break
}
if pastActiveDeadline(rayJobInstance) {
r.Log.Info("The RayJob has passed the activeDeadlineSeconds. Transition the status to `Complete`.", "StartTime", rayJobInstance.Status.StartTime, "ActiveDeadlineSeconds", *rayJobInstance.Spec.ActiveDeadlineSeconds)
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusComplete
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think exceeding deadline seconds warrants a new status DeadlineExceeded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not -- if user sees status Complete, are the ray-operator logs the only way to know if a job successfully completed or timed out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will improve the observability of Complete before KubeRay v1.1.0 release. Last week, I chatted with the Flyte community, and they require sufficient observability to distinguish between "Succeeded" and "Failed" statuses, including the reasons for each (e.g. backoffLimit, app-level bug, DeadlineExceeded).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a Reason field sgtm 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yicheng-Lu-llll will take this issue.

@kevin85421 kevin85421 merged commit 0074129 into ray-project:master Feb 21, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants