Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial KEP for maxUnavailable in StatefulSets #678

Merged
merged 2 commits into from
Mar 30, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions keps/sig-apps/20190226-maxunavailable-for-statefulsets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
title: Implement maxUnavailable for StatefulSets
authors:
- "@krmayankk"
owning-sig: sig-apps
participating-sigs:
- sig-apps
reviewers:
- "@janetkuo"
approvers:
- TBD
editor: TBD
creation-date: 2018-12-29
last-updated: 2018-12-29
status: provisional
see-also:
- n/a
replaces:
- n/a
superseded-by:
- n/a
---

# Implement maxUnavailable in StatefulSet

## Table of Contents

* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [User Stories [optional]](#user-stories-optional)
* [Story 1](#story-1)
* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Drawbacks [optional]](#drawbacks-optional)
* [Alternatives [optional]](#alternatives-optional)

[Tools for generating]: https://github.com/ekalinin/github-markdown-toc

## Summary

The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. When a StatefulSet’s
`.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller will delete and recreate each Pod
in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating
will proceed `maxUnavailable` number of pods at a time. Note, that maxUnavailable does not affect podManagementPolicy which
is only applicable during scaling.


## Motivation

Consider the following scenarios:-

1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling update creates a new pod name and hence the metrics
published by these new pod starts a new time series which makes tracking metrics for the application difficult. While this could be mitigated,
it requires some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt
change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
2. My Container does some initial startup tasks like loading up cache or something that takes a lot of time. If we used StatefulSet, we can only go one
pod at a time which would result in a slow rolling update. If we did maxUnavailable for StatefulSet with a greater than 1 number, it would allow for a
faster rollout.
3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going
down at the same time. I want to be able to do faster rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet
supports maxUnavailable in Rolling Updates.
4. Sometimes i just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding
that requires diving into the complicacy of hashing and how replicasets are named. Over and above that, there are some issues with hash collisions which
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which I believe are easier to think and reason
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we
could track the revisions by ControllerRevisions.

With this feature in place, when using StatefulSet with maxUnavailable >1, the user understands that this would not cause issues with their Stateful
Applications which have per pod state and identity while still providing all of the above written advantages.

### Goals
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time,
during Rolling Update.

### Non-Goals
maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond
the purview of this KEP.

## Proposal

### User Stories

#### Story 1
As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate way, if my Stateful app can tolerate
more than one pod being down, thus allowing my update to finish much faster.

### Implementation Details

#### API Changes

Following changes will be made to the Rolling Update Strategy for StatefulSet.

```go
// RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType.
type RollingUpdateStatefulSetStrategy struct {
// THIS IS AN EXISTING FIELD
// Partition indicates the ordinal at which the StatefulSet should be
// partitioned.
// Default value is 0.
// +optional
Partition *int32 `json:"partition,omitempty" protobuf:"varint,1,opt,name=partition"`

// NOTE THIS IS THE NEW FIELD BEING PROPOSED
// The maximum number of pods that can be unavailable during the update.
// Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
// Absolute number is calculated from percentage by rounding down.
// Defaults to 1.
// +optional
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"`

...
}
```

- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This
will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
- If maxUnavailable is specified, it cannot be greater than total number of replicas.
- If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition`
- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the partition. Which means all Pods
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kow3ns @janetkuo @Kargakis @FillZpp i have added three options below for more discussions, please review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pinging again @kow3ns @janetkuo who should i follow up to get this merged and should the followup discussion happen in a new PR or a issue for further design details ?

with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets
say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
are the behavior choices we have:-
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). Once they are both running and ready, pods with
ordinal 2 will go down. Pods with ordinal 0 and 1 will remain untouched due the partition.
- pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods
with ordinal 2 will start going down. This could violate ordering guarantees, since if 3 is running and ready, then both 4 and 2
are terminating at the same time out of order.
- pod with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When 4 is running and ready, 2 will go down. At
this time both 2 and 3 are terminating. If 3 is running and ready before 4, 2 wont go down to preserve ordering semantics. So at
this time, only 1 is unavailable although we requested 2.
- NOTE: The goal is faster updates of an application. In some cases , people would need both ordering and faster updates. In other cases
they just need faster updates and they dont care about ordering as long as they get identity. We need to find which one users care
about more

#### Implementation

https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504
```go
...
podsDeleted := 0
// we terminate the Pod with the largest ordinal that does not match the update revision.
for target := len(replicas) - 1; target >= updateMin; target-- {

// delete the Pod if it is not already terminating and does not match the update revision.
if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
set.Namespace,
set.Name,
replicas[target].Name)
err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
status.CurrentReplicas--

// NEW CODE HERE
if podsDeleted < set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable {
podsDeleted ++;
continue;
}
Copy link
Member

@janetkuo janetkuo Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should describe how this feature can be implemented and how it interacts with existing features, such as partition and pod management policy, rather than a code block.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation cannot be concretely defined, untill we agree on the semantics. I will update the doc with the sematics in case of partition . Also as you rightly mentioned earlier, the maxUnavailable has nothing to do with pod management policy. So i will update the docs accordingly. Once we have agreed on the semantics i will update the implementation section as well.

return &status, err
}

// wait for unhealthy Pods on update
if !isHealthy(replicas[target]) {
klog.V(4).Infof(
"StatefulSet %s/%s is waiting for Pod %s to update",
set.Namespace,
set.Name,
replicas[target].Name)
return &status, nil
}

}
...
```

### Risks and Mitigations
We are proposing a new field called `maxUnavailable` whose default value will be 1. In this mode, StatefulSet will behave exactly like its current behavior.
Its possible we introduce a bug in the implementation. The mitigation currently is that is disabled by default in Alpha phase for people to try out and give
feedback.
In Beta phase when its enabled by default, people will only see issues or bugs when `maxUnavailable` is set to something greater than 1. Since people have
tried this feature in Alpha, we would have time to fix issues.


### Upgrades/Downgrades

- Upgrades
When upgrading from a release without this feature, to a release with maxUnavailable, we will set maxUnavailable to 1. This would give users the same default
behavior they have to come to expect of in previous releases
- Downgrades
When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases
-- if maxUnavailable is greater than 1 -- in this case user can see unexpected behavior(Find out what is the recommendation here(Warning, disable upgrade, drop field, etc? )
-- if maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior

### Tests

- maxUnavailable =1, Same behavior as today
- maxUnavailable greater than 1 without partition
- maxUnavailable greater than replicas without partition
- maxUnavailable greater than 1 with partition and staged pods less then maxUnavailable
- maxUnavailable greater than 1 with partition and staged pods same as maxUnavailable
- maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable
- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas

## Graduation Criteria

- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default.
- Beta: Enabled by default with default value of 1.


## Implementation History

- KEP Started on 1/1/2019
- Implementation PR and UT by 3/15

## Drawbacks [optional]

Why should this KEP _not_ be implemented.

## Alternatives

- Users who need StatefulSets stable identity and are ok with getting a slow rolling update will continue to use StatefulSets. Users who
are not ok with a slow rolling update, will continue to use Deployments with workarounds for the scenarios mentioned in the Motivations
section.
- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement
your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility.