-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add enforcedRollingUpdate strategy to statefulSet #3562
base: master
Are you sure you want to change the base?
Add enforcedRollingUpdate strategy to statefulSet #3562
Conversation
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
/sig apps |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kerthcet The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc @smarterclayton do you have time to review this? |
cc @kubernetes/sig-apps-feature-requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
I don't recall this problem being discussed during any of the sig-apps call, I'd suggest first showing up for one of our bi-weekly Monday's call, and discuss the problem and possible option for solving them before opening a KEP shortly before the freeze.
cause any serious problems, so we should treat the rolling-update strategies case by case, and better to hand the choices over to end-users. | ||
|
||
From the solutions of community users, we have two choices to avoid this, one is set the `podManagementPolicy` to `Parallel`, see [issue](https://github.com/jenkinsci/helm-charts/issues/688), | ||
but we will update the pods in parallel. Another one is use a controller and find the broken pods to delete, see [PR](https://github.com/argoproj-labs/argocd-operator/pull/604). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pod management policy should update in parallel up to the pod limit. If you set maxunavailable “1”, that sounds like the same as what this kep proposes? Or did I miss a key difference in the algorithm you are proposing?
I think I need more detail to understand what you expect to happen - if you could describe a sequence of events that results in a “stuck” scenario (1-N each step) that would be very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton can you look at this issue(kubernetes/kubernetes#67250) which is being referenced in this KEP. The scenario the author is highlighting is when a statefulset goes into a broken state due to a rolling update of a bad image as one example which requires deleting the pod, even after the image is fixed to correct image and was probably done to prevent data loss. Thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we even explicitly explain that case in the docs, so this was intentional. I'm not saying we shouldn't expand that functionality, but I'd like to hear more during one of sig-apps calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh This kind of (not) bugfixing seems strange to me: Just because you document an issue (e.g. kubernetes/kubernetes#67250) doesn't make it go away.
The docs don't even mention a workaround to avoid this behaviour.
"to prevent dataloss" - how do you want to recover the data, if you are not able to replace the pod definition with a working one?
In kubernetes/kubernetes#109597 I mentioned another scenario to break an update of a statefulSet: requesting a non-existant secret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the known issue in the official doc without any workaround is not helping at all. Also I would like to know exactly what's the purpose of not updating the unhealthy pods? what's the point of keeping those "dead" pods anyway?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
@soltysh Has any progress been made on this, or any discussions taken place? Right now, I've got stateful sets in a neverending stuck state, until I delete the pods, because there's no way to advance the image if the pods aren't healthy. |
any update? why it can't be merged? |
Some updates here to disperse the confusions: This proposal is just inited, as suggested, hope to see this topic been discussed in the bi-weekly meeting in sig-apps to make sure we're in the right way. I'm out of bandwidth right now, so if someone has any interest, plz bring this to the community meeting. Thanks. |
In my testing |
Under Parallel mode, yes, also see the description: https://github.com/kubernetes/enhancements/pull/3562/files#diff-1151d1efc62d73a39635cf501e30510a004b6c7e67c09e554a9ad3fd7ca87a81R211-R212 What we want to solve here is sequential rolling-update. |
Have you turned the feature gate on before testing? |
No, on 1.28 anyway, it seems maxUnavailable defaults to 1? Granted I only tested sts with a few pods and they always restarted one by one. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Hi @kerthcet , I would like to know if there's any update for this issue? I've encountered the similar issue in our k8s landscape and it comes to me with surprise that I thought the sts was behaving similarly with deployment. It would be helpful if sts can self recovered from broken state. |
Thanks for the concern @reborn1867 , but not planned for v1.31 as I have other KEPs with higher priority. Sorry for that. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
EnforcedRollingUpdate
to statefulset rolling update strategies #3541