Add enforcedRollingUpdate strategy to statefulSet #3562

kerthcet · 2022-09-28T15:38:08Z

One-line PR description: Add enforcedRollingUpdate strategy to statefulSet

Issue link: Add EnforcedRollingUpdate to statefulset rolling update strategies #3541

Other comments:

Signed-off-by: kerthcet <kerthcet@gmail.com>

kerthcet · 2022-09-28T15:38:16Z

/sig apps

k8s-ci-robot · 2022-09-28T15:38:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kerthcet
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval by writing /assign @johnbelamaric in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kerthcet · 2022-09-28T15:41:17Z

cc @smarterclayton do you have time to review this?

kerthcet · 2022-09-30T02:12:50Z

cc @kubernetes/sig-apps-feature-requests

soltysh

/hold
I don't recall this problem being discussed during any of the sig-apps call, I'd suggest first showing up for one of our bi-weekly Monday's call, and discuss the problem and possible option for solving them before opening a KEP shortly before the freeze.

smarterclayton · 2022-10-04T16:34:28Z

keps/sig-apps/3541-add-enforcedRollingUpdate-to-statefulset-update-strategies/README.md

+cause any serious problems, so we should treat the rolling-update strategies case by case, and better to hand the choices over to end-users.
+
+From the solutions of community users, we have two choices to avoid this, one is set the `podManagementPolicy` to `Parallel`, see [issue](https://github.com/jenkinsci/helm-charts/issues/688),
+but we will update the pods in parallel. Another one is use a controller and find the broken pods to delete, see [PR](https://github.com/argoproj-labs/argocd-operator/pull/604).


Pod management policy should update in parallel up to the pod limit. If you set maxunavailable “1”, that sounds like the same as what this kep proposes? Or did I miss a key difference in the algorithm you are proposing?

I think I need more detail to understand what you expect to happen - if you could describe a sequence of events that results in a “stuck” scenario (1-N each step) that would be very useful.

@smarterclayton can you look at this issue(kubernetes/kubernetes#67250) which is being referenced in this KEP. The scenario the author is highlighting is when a statefulset goes into a broken state due to a rolling update of a bad image as one example which requires deleting the pod, even after the image is fixed to correct image and was probably done to prevent data loss. Thoughts ?

Correct, we even explicitly explain that case in the docs, so this was intentional. I'm not saying we shouldn't expand that functionality, but I'd like to hear more during one of sig-apps calls.

@soltysh This kind of (not) bugfixing seems strange to me: Just because you document an issue (e.g. kubernetes/kubernetes#67250) doesn't make it go away.

The docs don't even mention a workaround to avoid this behaviour.

"to prevent dataloss" - how do you want to recover the data, if you are not able to replace the pod definition with a working one?

In kubernetes/kubernetes#109597 I mentioned another scenario to break an update of a statefulSet: requesting a non-existant secret.

Put the known issue in the official doc without any workaround is not helping at all. Also I would like to know exactly what's the purpose of not updating the unhealthy pods? what's the point of keeping those "dead" pods anyway?

k8s-triage-robot · 2023-01-04T11:04:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-02-22T23:30:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kerthcet · 2023-02-23T07:38:33Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-05-24T08:36:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-24T08:50:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kerthcet · 2023-06-25T15:40:29Z

/remove-lifecycle rotten

danbopes · 2023-07-17T15:43:14Z

@soltysh Has any progress been made on this, or any discussions taken place? Right now, I've got stateful sets in a neverending stuck state, until I delete the pods, because there's no way to advance the image if the pods aren't healthy.

vl-kp · 2023-08-10T00:54:40Z

any update? why it can't be merged?

kerthcet · 2023-08-17T02:45:36Z

Some updates here to disperse the confusions: This proposal is just inited, as suggested, hope to see this topic been discussed in the bi-weekly meeting in sig-apps to make sure we're in the right way. I'm out of bandwidth right now, so if someone has any interest, plz bring this to the community meeting. Thanks.

vaskozl · 2023-11-08T20:07:31Z

In my testing podManagementPolicy: Parallel completely solves this issue. By default maxUnavilable appears to be 1 so kubernetes restart one pod at a time during updates (true parallel startup/removal during scaling replicas).

kerthcet · 2023-11-09T02:01:20Z

In my testing podManagementPolicy: Parallel completely solves this issue. By default maxUnavilable appears to be 1 so kubernetes restart one pod at a time during updates (true parallel startup/removal during scaling replicas).

Under Parallel mode, yes, also see the description: https://github.com/kubernetes/enhancements/pull/3562/files#diff-1151d1efc62d73a39635cf501e30510a004b6c7e67c09e554a9ad3fd7ca87a81R211-R212

What we want to solve here is sequential rolling-update.

okgolove · 2023-11-14T10:34:03Z

@vaskozl

Note: The maxUnavailable field is in Alpha stage and it is honored only by API servers that are running with the MaxUnavailableStatefulSet feature gate enabled.

Have you turned the feature gate on before testing?

vaskozl · 2023-11-15T09:18:47Z

No, on 1.28 anyway, it seems maxUnavailable defaults to 1? Granted I only tested sts with a few pods and they always restarted one by one.

k8s-triage-robot · 2024-02-13T10:10:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-14T10:58:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

FloMedja · 2024-03-27T15:58:22Z

/remove-lifecycle rotten

reborn1867 · 2024-04-30T03:10:16Z

Hi @kerthcet , I would like to know if there's any update for this issue? I've encountered the similar issue in our k8s landscape and it comes to me with surprise that I thought the sts was behaving similarly with deployment. It would be helpful if sts can self recovered from broken state.

kerthcet · 2024-04-30T07:24:45Z

Thanks for the concern @reborn1867 , but not planned for v1.31 as I have other KEPs with higher priority. Sorry for that.

k8s-triage-robot · 2024-07-29T07:29:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mboutet · 2024-07-29T13:08:47Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-27T13:28:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2024-10-29T02:39:16Z

/remove-lifecycle stale

kerthcet added 2 commits September 28, 2022 23:33

add enforcedRollingUpdate

a4252fb

Signed-off-by: kerthcet <kerthcet@gmail.com>

fix toc format

78b53fb

Signed-off-by: kerthcet <kerthcet@gmail.com>

k8s-ci-robot requested review from kow3ns and soltysh September 28, 2022 15:38

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 28, 2022

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 28, 2022

kerthcet mentioned this pull request Sep 28, 2022

Add EnforcedRollingUpdate to statefulset rolling update strategies #3541

Open

4 tasks

kerthcet mentioned this pull request Sep 29, 2022

StatefulSet - can't rollback from a broken state kubernetes/kubernetes#67250

Open

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 30, 2022

soltysh requested changes Oct 4, 2022

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2022

smarterclayton reviewed Oct 4, 2022

View reviewed changes

kerthcet mentioned this pull request Nov 8, 2022

Setting toleration on StatefulSet doesn't restart associated Pods kubernetes/kubernetes#113671

Closed

gibizer mentioned this pull request Dec 12, 2022

Add top level NodeSelector and wire into CRs openstack-k8s-operators/nova-operator#176

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 23, 2023

hoyhbx mentioned this pull request Apr 7, 2023

Zookeeper operator is unable to recover the broken zk statefulset due to an issue in statefulSet controller pravega/zookeeper-operator#547

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 24, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 25, 2023

pwschuurman mentioned this pull request Sep 1, 2023

Pending pod are not recreated after updating statefulset resources kubernetes/kubernetes#120123

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 27, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add enforcedRollingUpdate strategy to statefulSet #3562

Add enforcedRollingUpdate strategy to statefulSet #3562

kerthcet commented Sep 28, 2022

kerthcet commented Sep 28, 2022

k8s-ci-robot commented Sep 28, 2022

kerthcet commented Sep 28, 2022

kerthcet commented Sep 30, 2022

soltysh left a comment

smarterclayton Oct 4, 2022

krmayankk Oct 6, 2022

soltysh Oct 6, 2022

robert-gdv Jan 23, 2023

reborn1867 Apr 30, 2024

k8s-triage-robot commented Jan 4, 2023

k8s-triage-robot commented Feb 22, 2023

kerthcet commented Feb 23, 2023

k8s-triage-robot commented May 24, 2023

k8s-triage-robot commented Jun 24, 2023

kerthcet commented Jun 25, 2023

danbopes commented Jul 17, 2023

vl-kp commented Aug 10, 2023

kerthcet commented Aug 17, 2023

vaskozl commented Nov 8, 2023

kerthcet commented Nov 9, 2023

okgolove commented Nov 14, 2023

vaskozl commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

FloMedja commented Mar 27, 2024

reborn1867 commented Apr 30, 2024

kerthcet commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

mboutet commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

kerthcet commented Oct 29, 2024

Add enforcedRollingUpdate strategy to statefulSet #3562

Are you sure you want to change the base?

Add enforcedRollingUpdate strategy to statefulSet #3562

Conversation

kerthcet commented Sep 28, 2022

kerthcet commented Sep 28, 2022

k8s-ci-robot commented Sep 28, 2022

kerthcet commented Sep 28, 2022

kerthcet commented Sep 30, 2022

soltysh left a comment

Choose a reason for hiding this comment

smarterclayton Oct 4, 2022

Choose a reason for hiding this comment

krmayankk Oct 6, 2022

Choose a reason for hiding this comment

soltysh Oct 6, 2022

Choose a reason for hiding this comment

robert-gdv Jan 23, 2023

Choose a reason for hiding this comment

reborn1867 Apr 30, 2024

Choose a reason for hiding this comment

k8s-triage-robot commented Jan 4, 2023

k8s-triage-robot commented Feb 22, 2023

kerthcet commented Feb 23, 2023

k8s-triage-robot commented May 24, 2023

k8s-triage-robot commented Jun 24, 2023

kerthcet commented Jun 25, 2023

danbopes commented Jul 17, 2023

vl-kp commented Aug 10, 2023

kerthcet commented Aug 17, 2023

vaskozl commented Nov 8, 2023

kerthcet commented Nov 9, 2023

okgolove commented Nov 14, 2023

vaskozl commented Nov 15, 2023

k8s-triage-robot commented Feb 13, 2024

k8s-triage-robot commented Mar 14, 2024

FloMedja commented Mar 27, 2024

reborn1867 commented Apr 30, 2024

kerthcet commented Apr 30, 2024

k8s-triage-robot commented Jul 29, 2024

mboutet commented Jul 29, 2024

k8s-triage-robot commented Oct 27, 2024

kerthcet commented Oct 29, 2024