Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3322: add a new field maxRestartTimesOnFailure to podSpec #3339

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kerthcet
Copy link
Member

@kerthcet kerthcet commented Jun 6, 2022

Signed-off-by: kerthcet kerthcet@gmail.com

  • One-line PR description: Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure
  • Other comments:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 6, 2022
@wojtek-t wojtek-t self-assigned this Jun 6, 2022
@kerthcet kerthcet force-pushed the feat/add-maxRetries-to-podSpec branch 2 times, most recently from af1ce55 to d2e8b5d Compare June 6, 2022 11:07
@kerthcet kerthcet force-pushed the feat/add-maxRetries-to-podSpec branch from d2e8b5d to affd244 Compare June 6, 2022 14:57
@kerthcet kerthcet changed the title [WIP] Feat: add a new field maxRestartTimes to podSpec [WIP] KEP-3322: add a new field maxRestartTimes to podSpec Jun 6, 2022
@kerthcet kerthcet force-pushed the feat/add-maxRetries-to-podSpec branch 2 times, most recently from 525ea53 to c52de33 Compare June 7, 2022 05:34
@kerthcet kerthcet changed the title [WIP] KEP-3322: add a new field maxRestartTimes to podSpec KEP-3322: add a new field maxRestartTimes to podSpec Jun 7, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2022
@kerthcet
Copy link
Member Author

kerthcet commented Jun 9, 2022

cc @wojtek-t PTAL, thanks a lot.

@wojtek-t
Copy link
Member

wojtek-t commented Jun 9, 2022

cc @wojtek-t PTAL, thanks a lot.

We're generally doing PRR once you already have SIG approval.

@kerthcet
Copy link
Member Author

cc @dchen1107 for sig-node side review, also cc @hex108

@scbizu
Copy link

scbizu commented Jun 16, 2022

This KEP is helpful especially for those pods that holds a large resource set such as the JVM based pod . We give these kinds of pods a high limit threshold to speed up their startup , restart always policy will make this worse , even the node crash. In the old days , daemon control tools like supervisorctl has its startretries mechanism to limit the max startup retries , but for k8s deployments there is no replacement for it .


Pros:
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index.
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. We will always need the annotation because Pods can be recreated, and we want to count failures across recreations.
But definitely the Job controller can benefit from this feature by setting different maxRestarts every time a pod is recreated.

I think a better "pro" is the fact that a container can be restarted in the kubelet much faster than a pod recreation.

Copy link
Member Author

@kerthcet kerthcet Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the implementation details of BackoffLimitPerIndex is something beyond the KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with Aldo.
It's not an implementation detail - it's a fundamental thing of "pod recreations". If we want to track something across pod recreations (which is the case for jobs), maxRestarts won't solve it on its own - but actually may help with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the implementation details of BackoffLimitPerIndex is something beyond the KEP.

It's not currently stated in the Non-Goals. But I think doing so would be a mistake. Even if we don't have the full semantics for how the pod level restart limit would work with backoffLimit or backoffLimitPerIndex, we should have some idea of how they could work together. Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.

I like this analysis.

Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status.

Cons:
* If we support max restart times with `restartPolicy=Never`, it semantically means _we can never
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a con? We can just validate that restartPolicy!=Never.

Copy link
Member Author

@kerthcet kerthcet Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's talking about the cons when supporting restartPolicy=Never.


#### Caveats

- When a Job's template is set with restartPolicy=OnFailure, and `backoffLimit` is greater than `maxRestartTimesOnFailure`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are talking about the case when parallelism=1.

I'm not liking these semantics, even if they are new. The backoffLimit in Job is already not accurate when restartPolicy=OnFailure https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy and kubernetes/kubernetes#109870

Note that when Pod is marked as Failed (as it will be when the Pod reaches maxRestartTimes), the logic would count the number of failures as 1, regardless of the number of container restarts. I know, it's ugly, but it's the legacy behavior. This might not give the desired outcome for users of Job.

What if when if the job controller observes the failure reason/condition for "reached max restarts", it would add this to the number of failures and count towards backoffLimit. You need to consider finalizers, but maybe it's doable?

We could also consider not allowing users of Job to set maxRestartTimes and let the job controller add it, in behalf of the user, when creating a Pod. But how does a user tell the Job API whether to use the old counting method or use maxRestartTimes?

So, overall, I think this section needs to consider some alternatives to deal with Job backoffLimit, for different levels of parallelism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mimowo mimowo Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different proposal - when a user specifies maxRestartTimes, then backoffLimit should only count pod when they are Failed, instead of checking individual containers.

This puts a clear separation of concerns between kubelet and Job controller, eliminating the issue kubernetes/kubernetes#109870.

In this proposal we could support counting across pod recreations by a dedicated Pod annotation which specifies how many pod restarts happened to this point. Job controller could set it based on the previous pod, and Kubelet subtract from maxRestartTimes.

The issue I see with setting maxRestartTimes based on backoffLimit is that it is per-pod, while backoffLimit is global.

So, another proposal would be that Job controller sets maxRestartTimes but based on the backoffLimitPerIndex (and the annotation counting how many failures happened before recreation).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different proposal - when a user specifies maxRestartTimes, then backoffLimit should only count pod when they are Failed, instead of checking individual containers.

I think this is the easiest solution. But would that be useful to users of Job? That I'm not sure about.

So, another proposal would be that Job controller sets maxRestartTimes but based on the backoffLimitPerIndex (and the annotation counting how many failures happened before recreation).

In the case of backoffLimitPerIndex it's much easier. The job controller would create the first Pod with full maxPodRestarts. If the Pod is preempted/evicted/etc, the replacement pod has a lower maxPodRestarts.

I'm not too sure what do do with regular backoffLimit... Maybe going with the easiest solution above is the best we can do?

Copy link
Contributor

@mimowo mimowo Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, with the approach both backoffLimit and backoffLimitPerIndex can be implemented very similarly to as currently when restartPolicy=Never. We only count failed pods (not container restarts), we don't even need the annotation. Simply, user sets maxRestartTimes once the pod reaches the Failed phase it is counted towards the backoff limits.

EDIT: with the setting of maxRestartTimes we know that the pod will eventually be terminal, which isn't the case currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea that the only difference between backoffLimit and backoffLimitPerIndex is that one counts failures globally, and the other per index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of the idea of a global backoffLimit and local maxRestarts, which allows granular control over your job. For Indexed Jobs where you set backoffLimitPerIndex it seems reasonable that we'll copy that value into that new field maxRestarts to get rid of the extra counting we currently have from job controller. On the other hand we clearly excluded OnFailure from that other KEP, but the discussions here are around having a global mechanism, so how this affects backoffLimitPerIndex are still to be decided once we have a crisp idea how this proposals finishes. #3850 is currently in alpha, I'm inclined to block GA of it until we have this feature alpha or ideally beta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndeed the number of restarts when backoffLimitPerIndex is specified along with maxRestartTimes would be: maxRestartTimes * (backoffLimitPerIndex+1)

The problem is that this is not always accurate. If a Pod is evicted before it completes its maxRestartTimes, then the equation no longer applies. So overall I think this makes the feature highly unpredictable.

I like the idea of making #3850 GA depend on beta for #3322. But I really see it as the job controller using the field maxRestarts, rather than letting the user set it up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, what this means for the purpose of #3322 going to alpha, is that we should have a validation rule in Job that forbids users from setting maxRestarts for Indexed Jobs (or something along those lines).

Copy link
Contributor

@mimowo mimowo Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced about blocking graduation of #3850, which is technically scoped to Never so there is no direct interaction, unless we want to extend its scope.

I see these approaches:

  1. limit 3850 to Never, and support maxRestarts with backoffLimitPerIndex and backoffLimit when OnFailure in a dedicated KEP - that KEP could also enable it for podFailurePolicy
  2. support OnFailure in 3850: maxRestarts with backoffLimitPerIndex when OnFailure , but without backoffLimit or podFailurePolicy
  3. support OnFailure in 3850: maxRestarts with backoffLimitPerIndex when OnFailure, along with backoffLimit (and possibly podFailurePolicy.

My preference is for (1.) to keep KEPs small and graduating fast, but I'm ok with (2.). My preference for (1.) is that when approaching (2.) we may scope creep to (3.), and possibly pod failure policy.

Thus, I feel like there is room for a dedicated KEP focused on enabling OnFailure with maxRestarts for backoffLimitPerIndex, backoffLimit and pod failure policy. The need for pod failure policy was also discussed here: #3329 (comment).

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no SIG-level agreement on it - I made a quick pass and added some comments, but please ping me only once you have SIG-level approval.


Pros:
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index.
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with Aldo.
It's not an implementation detail - it's a fundamental thing of "pod recreations". If we want to track something across pod recreations (which is the case for jobs), maxRestarts won't solve it on its own - but actually may help with it.

nitty-gritty.
-->

Add a new field `maxRestartTimesOnFailure` to the `pod.spec`. `maxRestartTimesOnFailure` only applies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Tim - I think that generalizing it to "Always" is natural and instead of making the API narrow, let's make it more generic.

will rollout across nodes.
-->

Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it.

FWUW - this section isn't necessary for Alpha so given time bounds you may want to delete your answers from rollout and monitoring sections as their answers are controversial...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is when upgrading api servers, we'll wait until all the apiservers are ready, then upgrade the kubelet. So if feature gates are enabled on some apiservers, we'll do nothing. Is this reasonable? Or what we want here is all the possibilities not the best practices, since it said as paranoid as possible. cc @wojtek-t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed for alpha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just comment out this section.

reviewers:
- TBD
approvers:
- TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to find an approver. Without approver defined we unlikely can take it.

@mrunalp @derekwaynecarr @dchen1107 any of you want to take it?

```

- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply.
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two questions

  1. if Pod's RestartPolicy is OnFailure and maxRestartTimesOnFailure is 0, invalid? or means Never?
  2. is the maxRestartTimesOnFailure editable for the pod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO:

  1. if restartPolicy is "OnFailure" and maxRestarts is 0, it is effectively "Never". I don't think we need to special-case 0 to be a failure, but I don't feel strongly and could be argued either way.
  2. let's start with "no" and see if there's really a need?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it feels like never to me too.

@thockin
Copy link
Member

thockin commented Jun 15, 2023

Deadline is in ~8 hours -- Is this still hoping to land?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: drinktee, kerthcet
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign derekwaynecarr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


Pros:
* BackoffLimitPerIndex can reuse this functionality and no longer need to consider the restart times per index.
Specifically, it can avoid to use the annotation anymore, and works at a high level control by watching the pod status.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we could end up in another scenario where features work independently, but when paired together are mostly useless or have very strange semantics that are hard for users to understand, and hard to maintain.

I like this analysis.

Pros:
* Reduce the maintenance cost of Job API

Cons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not the same pod. Job is a higher-level contruct.

Yes, it feels like a conflation of a pod-level maxRestarts and a job-level maxRecreates or something.

```

- If Pod's `RestartPolicy` is `Always` or `Never`, `maxRestartTimesOnFailure` is default to nil, and will not apply.
- If Pod's `RestartPolicy` is `OnFailure`, `maxRestartTimesOnFailure` is also default to nil, which means infinite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it feels like never to me too.

restart times for backwards compatibility.

In runtime, we'll check the sum of `RestartCount` of
all containers [`Status`](https://github.com/kubernetes/kubernetes/blob/451e1fa8bcff38b87504eebd414948e505526d01/pkg/kubelet/container/runtime.go#L306-L335)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod.spec.containers[].maxRestarts reads well to me.

But it's a little weird because Containers also have a RestartPolicy and the only allowed value is Always. It would stop being intuitive how the attribute interactions actually work because now we're intermingling pod level attributes with container level attributes.

CRI or CNI may require updating that component before the kubelet.
-->

The kubelet version should be consistent with the api-server version, or this feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used?

In other words, if we have a kubelet at 1.26 and a kube-apiserver at 1.29 with the feature enabled, what is the expected behavior?

Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet.

I believe the answer to this should be no.


When we set the restartPolicy=OnFailure and set a specific maxRestartTimesOnFailure number,
but Pod restarts times is not equal to the number.
Or we can refer to the metric `pod_exceed_restart_times_size` for comparison.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new or existing metric?

will rollout across nodes.
-->

Because kubelet will not upgrade/downgrade until api-server ready, so this will not impact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just comment out this section.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024
@kerthcet
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2024
@adampl
Copy link

adampl commented May 24, 2024

hey, it's 2024, any update?

/lifecycle frozen

@k8s-ci-robot
Copy link
Contributor

@adampl: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

hey, it's 2024, any update?

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2024
@adampl
Copy link

adampl commented Jun 24, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 24, 2024
@kannon92
Copy link
Contributor

@kerthcet Are you going to work on this?

@alculquicondor
Copy link
Member

FYI on a somewhat related KEP #4603

@kannon92
Copy link
Contributor

@alculquicondor I'm trying to see if this is a KEP that sig-node should help review for 1.32.

@kannon92
Copy link
Contributor

@lauralorenz presented the plan for CrashLoopBackOff.

I think that any kind of capping max restart time is out of scope for #4603.

@kerthcet
Copy link
Member Author

Not yet, if you're interested, you can take it over, I have other works trapping me right now.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2024
@GEownt
Copy link

GEownt commented Dec 5, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.