Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enforcedRollingUpdate strategy to statefulSet #3562

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

kerthcet
Copy link
Member

  • One-line PR description: Add enforcedRollingUpdate strategy to statefulSet
  • Other comments:

Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
@kerthcet
Copy link
Member Author

/sig apps

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 28, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kerthcet
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval by writing /assign @johnbelamaric in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 28, 2022
@kerthcet
Copy link
Member Author

cc @smarterclayton do you have time to review this?

@kerthcet
Copy link
Member Author

cc @kubernetes/sig-apps-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 30, 2022
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
I don't recall this problem being discussed during any of the sig-apps call, I'd suggest first showing up for one of our bi-weekly Monday's call, and discuss the problem and possible option for solving them before opening a KEP shortly before the freeze.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 4, 2022
cause any serious problems, so we should treat the rolling-update strategies case by case, and better to hand the choices over to end-users.

From the solutions of community users, we have two choices to avoid this, one is set the `podManagementPolicy` to `Parallel`, see [issue](https://github.com/jenkinsci/helm-charts/issues/688),
but we will update the pods in parallel. Another one is use a controller and find the broken pods to delete, see [PR](https://github.com/argoproj-labs/argocd-operator/pull/604).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod management policy should update in parallel up to the pod limit. If you set maxunavailable “1”, that sounds like the same as what this kep proposes? Or did I miss a key difference in the algorithm you are proposing?

I think I need more detail to understand what you expect to happen - if you could describe a sequence of events that results in a “stuck” scenario (1-N each step) that would be very useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton can you look at this issue(kubernetes/kubernetes#67250) which is being referenced in this KEP. The scenario the author is highlighting is when a statefulset goes into a broken state due to a rolling update of a bad image as one example which requires deleting the pod, even after the image is fixed to correct image and was probably done to prevent data loss. Thoughts ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we even explicitly explain that case in the docs, so this was intentional. I'm not saying we shouldn't expand that functionality, but I'd like to hear more during one of sig-apps calls.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh This kind of (not) bugfixing seems strange to me: Just because you document an issue (e.g. kubernetes/kubernetes#67250) doesn't make it go away.

The docs don't even mention a workaround to avoid this behaviour.

"to prevent dataloss" - how do you want to recover the data, if you are not able to replace the pod definition with a working one?

In kubernetes/kubernetes#109597 I mentioned another scenario to break an update of a statefulSet: requesting a non-existant secret.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the known issue in the official doc without any workaround is not helping at all. Also I would like to know exactly what's the purpose of not updating the unhealthy pods? what's the point of keeping those "dead" pods anyway?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2023
@kerthcet
Copy link
Member Author

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 24, 2023
@kerthcet
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 25, 2023
@danbopes
Copy link

@soltysh Has any progress been made on this, or any discussions taken place? Right now, I've got stateful sets in a neverending stuck state, until I delete the pods, because there's no way to advance the image if the pods aren't healthy.

@vl-kp
Copy link

vl-kp commented Aug 10, 2023

any update? why it can't be merged?

@kerthcet
Copy link
Member Author

Some updates here to disperse the confusions: This proposal is just inited, as suggested, hope to see this topic been discussed in the bi-weekly meeting in sig-apps to make sure we're in the right way. I'm out of bandwidth right now, so if someone has any interest, plz bring this to the community meeting. Thanks.

@vaskozl
Copy link

vaskozl commented Nov 8, 2023

In my testing podManagementPolicy: Parallel completely solves this issue. By default maxUnavilable appears to be 1 so kubernetes restart one pod at a time during updates (true parallel startup/removal during scaling replicas).

@kerthcet
Copy link
Member Author

kerthcet commented Nov 9, 2023

In my testing podManagementPolicy: Parallel completely solves this issue. By default maxUnavilable appears to be 1 so kubernetes restart one pod at a time during updates (true parallel startup/removal during scaling replicas).

Under Parallel mode, yes, also see the description: https://github.com/kubernetes/enhancements/pull/3562/files#diff-1151d1efc62d73a39635cf501e30510a004b6c7e67c09e554a9ad3fd7ca87a81R211-R212

What we want to solve here is sequential rolling-update.

@okgolove
Copy link

@vaskozl

Note: The maxUnavailable field is in Alpha stage and it is honored only by API servers that are running with the MaxUnavailableStatefulSet feature gate enabled.

Have you turned the feature gate on before testing?

@vaskozl
Copy link

vaskozl commented Nov 15, 2023

No, on 1.28 anyway, it seems maxUnavailable defaults to 1? Granted I only tested sts with a few pods and they always restarted one by one.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 14, 2024
@FloMedja
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 27, 2024
@reborn1867
Copy link

Hi @kerthcet , I would like to know if there's any update for this issue? I've encountered the similar issue in our k8s landscape and it comes to me with surprise that I thought the sts was behaving similarly with deployment. It would be helpful if sts can self recovered from broken state.

@kerthcet
Copy link
Member Author

Thanks for the concern @reborn1867 , but not planned for v1.31 as I have other KEPs with higher priority. Sorry for that.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024
@mboutet
Copy link

mboutet commented Jul 29, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2024
@kerthcet
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.