-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3322: add a new field maxRestartTimesOnFailure to podSpec #3339
base: master
Are you sure you want to change the base?
KEP-3322: add a new field maxRestartTimesOnFailure to podSpec #3339
Conversation
af1ce55
to
d2e8b5d
Compare
d2e8b5d
to
affd244
Compare
525ea53
to
c52de33
Compare
cc @wojtek-t PTAL, thanks a lot. |
We're generally doing PRR once you already have SIG approval. |
cc @dchen1107 for sig-node side review, also cc @hex108 |
ba886ba
to
b2c9f56
Compare
This KEP is helpful especially for those pods that holds a large resource set such as the JVM based pod . We give these kinds of pods a high limit threshold to speed up their startup , restart always policy will make this worse , even the node crash. In the old days , daemon control tools like supervisorctl has its startretries mechanism to limit the max startup retries , but for k8s deployments there is no replacement for it . |
|
||
Pros: | ||
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index. | ||
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. We will always need the annotation because Pods can be recreated, and we want to count failures across recreations.
But definitely the Job controller can benefit from this feature by setting different maxRestarts
every time a pod is recreated.
I think a better "pro" is the fact that a container can be restarted in the kubelet much faster than a pod recreation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the implementation details of BackoffLimitPerIndex
is something beyond the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with Aldo.
It's not an implementation detail - it's a fundamental thing of "pod recreations". If we want to track something across pod recreations (which is the case for jobs), maxRestarts won't solve it on its own - but actually may help with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the implementation details of BackoffLimitPerIndex is something beyond the KEP.
It's not currently stated in the Non-Goals. But I think doing so would be a mistake. Even if we don't have the full semantics for how the pod level restart limit would work with backoffLimit or backoffLimitPerIndex, we should have some idea of how they could work together. Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.
I like this analysis.
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. | ||
|
||
Cons: | ||
* If we support max restart times with `restartPolicy=Never`, it semantically means _we can never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a con? We can just validate that restartPolicy!=Never.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's talking about the cons when supporting restartPolicy=Never
.
|
||
#### Caveats | ||
|
||
- When a Job's template is set with restartPolicy=OnFailure, and `backoffLimit` is greater than `maxRestartTimesOnFailure`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are talking about the case when parallelism=1.
I'm not liking these semantics, even if they are new. The backoffLimit in Job is already not accurate when restartPolicy=OnFailure https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy and kubernetes/kubernetes#109870
Note that when Pod is marked as Failed (as it will be when the Pod reaches maxRestartTimes
), the logic would count the number of failures as 1, regardless of the number of container restarts. I know, it's ugly, but it's the legacy behavior. This might not give the desired outcome for users of Job.
What if when if the job controller observes the failure reason/condition for "reached max restarts", it would add this to the number of failures and count towards backoffLimit. You need to consider finalizers, but maybe it's doable?
We could also consider not allowing users of Job to set maxRestartTimes
and let the job controller add it, in behalf of the user, when creating a Pod. But how does a user tell the Job API whether to use the old counting method or use maxRestartTimes
?
So, overall, I think this section needs to consider some alternatives to deal with Job backoffLimit, for different levels of parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different proposal - when a user specifies maxRestartTimes
, then backoffLimit
should only count pod when they are Failed
, instead of checking individual containers.
This puts a clear separation of concerns between kubelet and Job controller, eliminating the issue kubernetes/kubernetes#109870.
In this proposal we could support counting across pod recreations by a dedicated Pod annotation which specifies how many pod restarts happened to this point. Job controller could set it based on the previous pod, and Kubelet subtract from maxRestartTimes
.
The issue I see with setting maxRestartTimes
based on backoffLimit
is that it is per-pod, while backoffLimit
is global.
So, another proposal would be that Job controller sets maxRestartTimes
but based on the backoffLimitPerIndex
(and the annotation counting how many failures happened before recreation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different proposal - when a user specifies maxRestartTimes, then backoffLimit should only count pod when they are Failed, instead of checking individual containers.
I think this is the easiest solution. But would that be useful to users of Job? That I'm not sure about.
So, another proposal would be that Job controller sets maxRestartTimes but based on the backoffLimitPerIndex (and the annotation counting how many failures happened before recreation).
In the case of backoffLimitPerIndex it's much easier. The job controller would create the first Pod with full maxPodRestarts. If the Pod is preempted/evicted/etc, the replacement pod has a lower maxPodRestarts.
I'm not too sure what do do with regular backoffLimit... Maybe going with the easiest solution above is the best we can do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, with the approach both backoffLimit
and backoffLimitPerIndex
can be implemented very similarly to as currently when restartPolicy=Never
. We only count failed pods (not container restarts), we don't even need the annotation. Simply, user sets maxRestartTimes
once the pod reaches the Failed
phase it is counted towards the backoff limits.
EDIT: with the setting of maxRestartTimes
we know that the pod will eventually be terminal, which isn't the case currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea that the only difference between backoffLimit
and backoffLimitPerIndex
is that one counts failures globally, and the other per index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of the idea of a global backoffLimit
and local maxRestarts
, which allows granular control over your job. For Indexed Jobs where you set backoffLimitPerIndex
it seems reasonable that we'll copy that value into that new field maxRestarts
to get rid of the extra counting we currently have from job controller. On the other hand we clearly excluded OnFailure
from that other KEP, but the discussions here are around having a global mechanism, so how this affects backoffLimitPerIndex
are still to be decided once we have a crisp idea how this proposals finishes. #3850 is currently in alpha, I'm inclined to block GA of it until we have this feature alpha or ideally beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndeed the number of restarts when
backoffLimitPerIndex
is specified along withmaxRestartTimes
would be:maxRestartTimes * (backoffLimitPerIndex+1)
The problem is that this is not always accurate. If a Pod is evicted before it completes its maxRestartTimes
, then the equation no longer applies. So overall I think this makes the feature highly unpredictable.
I like the idea of making #3850 GA depend on beta for #3322. But I really see it as the job controller using the field maxRestarts
, rather than letting the user set it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, what this means for the purpose of #3322 going to alpha, is that we should have a validation rule in Job that forbids users from setting maxRestarts
for Indexed Jobs (or something along those lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced about blocking graduation of #3850, which is technically scoped to Never
so there is no direct interaction, unless we want to extend its scope.
I see these approaches:
- limit 3850 to
Never
, and supportmaxRestarts
withbackoffLimitPerIndex
andbackoffLimit
whenOnFailure
in a dedicated KEP - that KEP could also enable it for podFailurePolicy - support
OnFailure
in 3850:maxRestarts
withbackoffLimitPerIndex
whenOnFailure
, but withoutbackoffLimit
orpodFailurePolicy
- support
OnFailure
in 3850:maxRestarts
withbackoffLimitPerIndex
whenOnFailure
, along withbackoffLimit
(and possiblypodFailurePolicy
.
My preference is for (1.) to keep KEPs small and graduating fast, but I'm ok with (2.). My preference for (1.) is that when approaching (2.) we may scope creep to (3.), and possibly pod failure policy.
Thus, I feel like there is room for a dedicated KEP focused on enabling OnFailure
with maxRestarts
for backoffLimitPerIndex
, backoffLimit
and pod failure policy. The need for pod failure policy was also discussed here: #3329 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is no SIG-level agreement on it - I made a quick pass and added some comments, but please ping me only once you have SIG-level approval.
|
||
Pros: | ||
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index. | ||
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with Aldo.
It's not an implementation detail - it's a fundamental thing of "pod recreations". If we want to track something across pod recreations (which is the case for jobs), maxRestarts won't solve it on its own - but actually may help with it.
nitty-gritty. | ||
--> | ||
|
||
Add a new field `maxRestartTimesOnFailure` to the `pod.spec`. `maxRestartTimesOnFailure` only applies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Tim - I think that generalizing it to "Always" is natural and instead of making the API narrow, let's make it more generic.
will rollout across nodes. | ||
--> | ||
|
||
Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it.
FWUW - this section isn't necessary for Alpha so given time bounds you may want to delete your answers from rollout and monitoring sections as their answers are controversial...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean here is when upgrading api servers, we'll wait until all the apiservers are ready, then upgrade the kubelet. So if feature gates are enabled on some apiservers, we'll do nothing. Is this reasonable? Or what we want here is all the possibilities not the best practices, since it said as paranoid as possible
. cc @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for alpha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just comment out this section.
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to find an approver. Without approver defined we unlikely can take it.
@mrunalp @derekwaynecarr @dchen1107 any of you want to take it?
``` | ||
|
||
- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply. | ||
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two questions
- if Pod's
RestartPolicy
isOnFailure
andmaxRestartTimesOnFailure
is 0, invalid? or meansNever
? - is the
maxRestartTimesOnFailure
editable for the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO:
- if
restartPolicy
is "OnFailure" andmaxRestarts
is 0, it is effectively "Never". I don't think we need to special-case 0 to be a failure, but I don't feel strongly and could be argued either way. - let's start with "no" and see if there's really a need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it feels like never to me too.
Deadline is in ~8 hours -- Is this still hoping to land? |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: drinktee, kerthcet The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
Pros: | ||
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index. | ||
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.
I like this analysis.
Pros: | ||
* Reduce the maintenance cost of Job API | ||
|
||
Cons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not the same pod. Job is a higher-level contruct.
Yes, it feels like a conflation of a pod-level maxRestarts
and a job-level maxRecreates
or something.
``` | ||
|
||
- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply. | ||
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it feels like never to me too.
restart times for backwards compatibility. | ||
|
||
In runtime, we'll check the sum of `RestartCount` of | ||
all containers [`Status`](https://github.com/kubernetes/kubernetes/blob/451e1fa8bcff38b87504eebd414948e505526d01/pkg/kubelet/container/runtime.go#L306-L335) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod.spec.containers[].maxRestarts
reads well to me.
But it's a little weird because Containers also have a RestartPolicy
and the only allowed value is Always
. It would stop being intuitive how the attribute interactions actually work because now we're intermingling pod level attributes with container level attributes.
CRI or CNI may require updating that component before the kubelet. | ||
--> | ||
|
||
The kubelet version should be consistent with the api-server version, or this feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used?
In other words, if we have a kubelet at 1.26 and a kube-apiserver at 1.29 with the feature enabled, what is the expected behavior?
Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet.
I believe the answer to this should be no.
|
||
When we set the restartPolicy=OnFailure and set a specific maxRestartTimesOnFailure number, | ||
but Pod restarts times is not equal to the number. | ||
Or we can refer to the metric `pod_exceed_restart_times_size` for comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new or existing metric?
will rollout across nodes. | ||
--> | ||
|
||
Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just comment out this section.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
hey, it's 2024, any update? /lifecycle frozen |
@adampl: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
@kerthcet Are you going to work on this? |
FYI on a somewhat related KEP #4603 |
@alculquicondor I'm trying to see if this is a KEP that sig-node should help review for 1.32. |
@lauralorenz presented the plan for CrashLoopBackOff. I think that any kind of capping max restart time is out of scope for #4603. |
Not yet, if you're interested, you can take it over, I have other works trapping me right now. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Signed-off-by: kerthcet kerthcet@gmail.com