-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add initial KEP for maxUnavailable in StatefulSets #678
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
--- | ||
title: Implement maxUnavailable for StatefulSets | ||
authors: | ||
- "@krmayankk" | ||
owning-sig: sig-apps | ||
participating-sigs: | ||
- sig-apps | ||
reviewers: | ||
- "@janetkuo" | ||
approvers: | ||
- TBD | ||
editor: TBD | ||
creation-date: 2018-12-29 | ||
last-updated: 2018-12-29 | ||
status: provisional | ||
see-also: | ||
- n/a | ||
replaces: | ||
- n/a | ||
superseded-by: | ||
- n/a | ||
--- | ||
|
||
# Implement maxUnavailable in StatefulSet | ||
|
||
## Table of Contents | ||
|
||
* [Table of Contents](#table-of-contents) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [User Stories [optional]](#user-stories-optional) | ||
* [Story 1](#story-1) | ||
* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
* [Drawbacks [optional]](#drawbacks-optional) | ||
* [Alternatives [optional]](#alternatives-optional) | ||
|
||
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc | ||
|
||
## Summary | ||
|
||
The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. When a StatefulSet’s | ||
`.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller will delete and recreate each Pod | ||
in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating | ||
will proceed `maxUnavailable` number of pods at a time. Note, that maxUnavailable does not affect podManagementPolicy which | ||
is only applicable during scaling. | ||
|
||
|
||
## Motivation | ||
|
||
Consider the following scenarios:- | ||
|
||
1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling update creates a new pod name and hence the metrics | ||
published by these new pod starts a new time series which makes tracking metrics for the application difficult. While this could be mitigated, | ||
it requires some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt | ||
change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments. | ||
2. My Container does some initial startup tasks like loading up cache or something that takes a lot of time. If we used StatefulSet, we can only go one | ||
pod at a time which would result in a slow rolling update. If we did maxUnavailable for StatefulSet with a greater than 1 number, it would allow for a | ||
faster rollout. | ||
3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going | ||
down at the same time. I want to be able to do faster rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet | ||
supports maxUnavailable in Rolling Updates. | ||
4. Sometimes i just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding | ||
that requires diving into the complicacy of hashing and how replicasets are named. Over and above that, there are some issues with hash collisions which | ||
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which I believe are easier to think and reason | ||
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we | ||
could track the revisions by ControllerRevisions. | ||
|
||
With this feature in place, when using StatefulSet with maxUnavailable >1, the user understands that this would not cause issues with their Stateful | ||
Applications which have per pod state and identity while still providing all of the above written advantages. | ||
|
||
### Goals | ||
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time, | ||
during Rolling Update. | ||
|
||
### Non-Goals | ||
maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond | ||
the purview of this KEP. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
#### Story 1 | ||
As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate way, if my Stateful app can tolerate | ||
more than one pod being down, thus allowing my update to finish much faster. | ||
|
||
### Implementation Details | ||
|
||
#### API Changes | ||
|
||
Following changes will be made to the Rolling Update Strategy for StatefulSet. | ||
|
||
```go | ||
// RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType. | ||
type RollingUpdateStatefulSetStrategy struct { | ||
// THIS IS AN EXISTING FIELD | ||
// Partition indicates the ordinal at which the StatefulSet should be | ||
// partitioned. | ||
// Default value is 0. | ||
// +optional | ||
Partition *int32 `json:"partition,omitempty" protobuf:"varint,1,opt,name=partition"` | ||
|
||
// NOTE THIS IS THE NEW FIELD BEING PROPOSED | ||
// The maximum number of pods that can be unavailable during the update. | ||
// Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). | ||
// Absolute number is calculated from percentage by rounding down. | ||
// Defaults to 1. | ||
// +optional | ||
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"` | ||
|
||
... | ||
} | ||
``` | ||
|
||
- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This | ||
will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field. | ||
- If maxUnavailable is specified, it cannot be greater than total number of replicas. | ||
- If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition` | ||
- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the partition. Which means all Pods | ||
with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets | ||
say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following | ||
are the behavior choices we have:- | ||
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). Once they are both running and ready, pods with | ||
ordinal 2 will go down. Pods with ordinal 0 and 1 will remain untouched due the partition. | ||
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods | ||
with ordinal 2 will start going down. This could violate ordering guarantees, since if 3 is running and ready, then both 4 and 2 | ||
are terminating at the same time out of order. | ||
- pod with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When 4 is running and ready, 2 will go down. At | ||
this time both 2 and 3 are terminating. If 3 is running and ready before 4, 2 wont go down to preserve ordering semantics. So at | ||
this time, only 1 is unavailable although we requested 2. | ||
- NOTE: The goal is faster updates of an application. In some cases , people would need both ordering and faster updates. In other cases | ||
they just need faster updates and they dont care about ordering as long as they get identity. We need to find which one users care | ||
about more | ||
|
||
#### Implementation | ||
|
||
https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504 | ||
```go | ||
... | ||
podsDeleted := 0 | ||
// we terminate the Pod with the largest ordinal that does not match the update revision. | ||
for target := len(replicas) - 1; target >= updateMin; target-- { | ||
|
||
// delete the Pod if it is not already terminating and does not match the update revision. | ||
if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) { | ||
klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update", | ||
set.Namespace, | ||
set.Name, | ||
replicas[target].Name) | ||
err := ssc.podControl.DeleteStatefulPod(set, replicas[target]) | ||
status.CurrentReplicas-- | ||
|
||
// NEW CODE HERE | ||
if podsDeleted < set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable { | ||
podsDeleted ++; | ||
continue; | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should describe how this feature can be implemented and how it interacts with existing features, such as partition and pod management policy, rather than a code block. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the implementation cannot be concretely defined, untill we agree on the semantics. I will update the doc with the sematics in case of partition . Also as you rightly mentioned earlier, the maxUnavailable has nothing to do with pod management policy. So i will update the docs accordingly. Once we have agreed on the semantics i will update the implementation section as well. |
||
return &status, err | ||
} | ||
|
||
// wait for unhealthy Pods on update | ||
if !isHealthy(replicas[target]) { | ||
klog.V(4).Infof( | ||
"StatefulSet %s/%s is waiting for Pod %s to update", | ||
set.Namespace, | ||
set.Name, | ||
replicas[target].Name) | ||
return &status, nil | ||
} | ||
|
||
} | ||
... | ||
``` | ||
|
||
### Risks and Mitigations | ||
We are proposing a new field called `maxUnavailable` whose default value will be 1. In this mode, StatefulSet will behave exactly like its current behavior. | ||
Its possible we introduce a bug in the implementation. The mitigation currently is that is disabled by default in Alpha phase for people to try out and give | ||
feedback. | ||
In Beta phase when its enabled by default, people will only see issues or bugs when `maxUnavailable` is set to something greater than 1. Since people have | ||
tried this feature in Alpha, we would have time to fix issues. | ||
|
||
|
||
### Upgrades/Downgrades | ||
|
||
- Upgrades | ||
When upgrading from a release without this feature, to a release with maxUnavailable, we will set maxUnavailable to 1. This would give users the same default | ||
behavior they have to come to expect of in previous releases | ||
- Downgrades | ||
When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases | ||
-- if maxUnavailable is greater than 1 -- in this case user can see unexpected behavior(Find out what is the recommendation here(Warning, disable upgrade, drop field, etc? ) | ||
-- if maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior | ||
|
||
### Tests | ||
|
||
- maxUnavailable =1, Same behavior as today | ||
- maxUnavailable greater than 1 without partition | ||
- maxUnavailable greater than replicas without partition | ||
- maxUnavailable greater than 1 with partition and staged pods less then maxUnavailable | ||
- maxUnavailable greater than 1 with partition and staged pods same as maxUnavailable | ||
- maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable | ||
- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas | ||
|
||
## Graduation Criteria | ||
|
||
- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default. | ||
- Beta: Enabled by default with default value of 1. | ||
|
||
|
||
## Implementation History | ||
|
||
- KEP Started on 1/1/2019 | ||
- Implementation PR and UT by 3/15 | ||
|
||
## Drawbacks [optional] | ||
|
||
Why should this KEP _not_ be implemented. | ||
|
||
## Alternatives | ||
|
||
- Users who need StatefulSets stable identity and are ok with getting a slow rolling update will continue to use StatefulSets. Users who | ||
are not ok with a slow rolling update, will continue to use Deployments with workarounds for the scenarios mentioned in the Motivations | ||
section. | ||
- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement | ||
your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kow3ns @janetkuo @Kargakis @FillZpp i have added three options below for more discussions, please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pinging again @kow3ns @janetkuo who should i follow up to get this merged and should the followup discussion happen in a new PR or a issue for further design details ?