-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayJob] Support ActiveDeadlineSeconds #1933
[RayJob] Support ActiveDeadlineSeconds #1933
Conversation
|
||
// TODO (kevin85421): For light-weight mode, calculate the number of failed retries and transition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may consider implementing backoffLimit if ActiveDeadlineSeconds proves to be insufficient.
cc @andrewsykim would you mind reviewing this PR? |
@@ -114,7 +115,6 @@ type RayJobStatus struct { | |||
RayClusterStatus RayClusterStatus `json:"rayClusterStatus,omitempty"` | |||
// observedGeneration is the most recent generation observed for this RayJob. It corresponds to the | |||
// RayJob's generation, which is updated on mutation by the API Server. | |||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found that it is useless. There is no change in the CRD YAML after I remove the +optional
.
@@ -142,6 +142,11 @@ func (r *RayJobReconciler) Reconcile(ctx context.Context, request ctrl.Request) | |||
if shouldUpdate := r.updateStatusToSuspendingIfNeeded(ctx, rayJobInstance); shouldUpdate { | |||
break | |||
} | |||
if pastActiveDeadline(rayJobInstance) { | |||
r.Log.Info("The RayJob has passed the activeDeadlineSeconds. Transition the status to `Complete`.", "StartTime", rayJobInstance.Status.StartTime, "ActiveDeadlineSeconds", *rayJobInstance.Spec.ActiveDeadlineSeconds) | |||
rayJobInstance.Status.JobDeploymentStatus = rayv1.JobDeploymentStatusComplete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think exceeding deadline seconds warrants a new status DeadlineExceeded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not -- if user sees status Complete
, are the ray-operator logs the only way to know if a job successfully completed or timed out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve the observability of Complete
before KubeRay v1.1.0 release. Last week, I chatted with the Flyte community, and they require sufficient observability to distinguish between "Succeeded" and "Failed" statuses, including the reasons for each (e.g. backoffLimit, app-level bug, DeadlineExceeded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a Reason field sgtm 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yicheng-Lu-llll will take this issue.
Why are these changes needed?
StartTime
refers to the time when the Ray job starts running on the RayCluster. This PR redefinesStartTime
to the time whenJobDeploymentStatus
transitioned fromNew
toInitializing
.ActiveDeadlineSeconds
: Similar toActiveDeadlineSeconds
in K8s Job https://kubernetes.io/docs/concepts/workloads/controllers/job/.ActiveDeadlineSeconds
if the RayJob is inNew
,Complete
,Suspending
, andSuspended
.ActiveDeadlineSeconds
if the RayJob is in theInitializing
orRunning
state, which may occupy some computing resources.ActiveDeadlineSeconds
, KubeRay will transition the status toComplete
.Related issue number
Checks