kubernetes · k8s-ci-robot · Mar 30, 2019 · Jan 7, 2019 · Mar 19, 2019 · krmayankk
diff --git a/keps/sig-apps/20190226-maxunavailable-for-statefulsets.md b/keps/sig-apps/20190226-maxunavailable-for-statefulsets.md
@@ -0,0 +1,231 @@
+---
+title: Implement maxUnavailable for StatefulSets
+authors:
+  - "@krmayankk"
+owning-sig: sig-apps
+participating-sigs:
+  - sig-apps
+reviewers:
+  - "@janetkuo"
+approvers:
+  - TBD
+editor: TBD
+creation-date: 2018-12-29
+last-updated: 2018-12-29
+status: provisional
+see-also:
+  - n/a
+replaces:
+  - n/a
+superseded-by:
+  - n/a
+---
+
+# Implement maxUnavailable in StatefulSet
+
+## Table of Contents
+
+* [Table of Contents](#table-of-contents)
+* [Summary](#summary)
+* [Motivation](#motivation)
+    * [Goals](#goals)
+    * [Non-Goals](#non-goals)
+* [Proposal](#proposal)
+    * [User Stories [optional]](#user-stories-optional)
+      * [Story 1](#story-1)
+    * [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
+    * [Risks and Mitigations](#risks-and-mitigations)
+* [Graduation Criteria](#graduation-criteria)
+* [Implementation History](#implementation-history)
+* [Drawbacks [optional]](#drawbacks-optional)
+* [Alternatives [optional]](#alternatives-optional)
+
+[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
+
+## Summary
+
+The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. When a StatefulSet’s 
+`.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller will delete and recreate each Pod 
+in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating 
+will proceed `maxUnavailable` number of pods at a time. Note, that maxUnavailable does not affect podManagementPolicy which
+is only applicable during scaling.
+
+
+## Motivation
+
+Consider the following scenarios:-
+
+1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling update creates a new pod name and hence the metrics 
+published by these new pod starts a new time series which makes tracking metrics for the application difficult. While this could be mitigated, 
+it requires some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt 
+change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments.
+2. My Container does some initial startup tasks like loading up cache or something that takes a lot of time. If we used StatefulSet, we can only go one 
+pod at a time which would result in a slow rolling update. If we did maxUnavailable for StatefulSet with a greater than 1 number, it would allow for a 
+faster rollout.
+3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going 
+down at the same time. I want to be able to do faster rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet
+supports maxUnavailable in Rolling Updates.
+4. Sometimes i just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding 
+that requires diving into the complicacy of hashing and how replicasets are named. Over and above that, there are some issues with hash collisions which 
+further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which I believe are easier to think and reason 
+about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we 
+could track the revisions by ControllerRevisions.
+
+With this feature in place, when using StatefulSet with maxUnavailable >1, the user understands that this would not cause issues with their Stateful 
+Applications which have per pod state and identity while still providing all of the above written advantages.
+
+### Goals
+StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time,
+during Rolling Update.
+
+### Non-Goals
+maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond 
+the purview of this KEP.
+
+## Proposal
+
+### User Stories
+
+#### Story 1
+As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate way, if my Stateful app can tolerate 
+more than one pod being down, thus allowing my update to finish much faster. 
+
+### Implementation Details
+
+#### API Changes
+
+Following changes will be made to the Rolling Update Strategy for StatefulSet.
+
+```go
+// RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType.
+type RollingUpdateStatefulSetStrategy struct {
+	// THIS IS AN EXISTING FIELD
+        // Partition indicates the ordinal at which the StatefulSet should be
+        // partitioned.
+        // Default value is 0.
+        // +optional
+        Partition *int32 `json:"partition,omitempty" protobuf:"varint,1,opt,name=partition"`
+
+	// NOTE THIS IS THE NEW FIELD BEING PROPOSED
+	// The maximum number of pods that can be unavailable during the update.
+        // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
+        // Absolute number is calculated from percentage by rounding down.
+        // Defaults to 1.
+        // +optional
+        MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"`
+
+	...
+}
+```
+
+- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This
+  will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field.
+- If maxUnavailable is specified, it cannot be greater than total number of replicas.
+- If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition`
+- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the partition. Which means all Pods 
+  with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets 
+  say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following
+  are the behavior choices we have:-
+  -  pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). Once they are both running and ready, pods with 
+     ordinal 2 will go down. Pods with ordinal 0 and 1 will remain untouched due the partition.
+  -  pods with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods
+     with ordinal 2 will start going down. This could violate ordering guarantees, since if 3 is running and ready, then both 4 and 2
+     are terminating at the same time out of order.
+  -  pod with ordinal 4 and 3 will go down at the same time(because of maxUnavailable). When 4 is running and ready, 2 will go down. At
+     this time both 2 and 3 are terminating. If 3 is running and ready before 4, 2 wont go down to preserve ordering semantics. So at 
+     this time, only 1 is unavailable although we requested 2.
+- NOTE: The goal is faster updates of an application. In some cases , people would need both ordering and faster updates. In other cases 
+  they just need faster updates and they dont care about ordering as long as they get identity. We need to find which one users care
+  about more
+
+#### Implementation
+
+https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset/stateful_set_control.go#L504
+```go
+...
+	podsDeleted := 0
+	// we terminate the Pod with the largest ordinal that does not match the update revision.
+	for target := len(replicas) - 1; target >= updateMin; target-- {
+
+		// delete the Pod if it is not already terminating and does not match the update revision.
+		if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
+			klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
+				set.Namespace,
+				set.Name,
+				replicas[target].Name)
+			err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
+			status.CurrentReplicas--
+
+			// NEW CODE HERE
+			if podsDeleted < set.Spec.UpdateStrategy.RollingUpdate.MaxUnavailable {
+				podsDeleted ++;
+				continue;
+			}
+			return &status, err
+		}
+
+		// wait for unhealthy Pods on update
+		if !isHealthy(replicas[target]) {
+			klog.V(4).Infof(
+				"StatefulSet %s/%s is waiting for Pod %s to update",
+				set.Namespace,
+				set.Name,
+				replicas[target].Name)
+			return &status, nil
+		}
+
+	}
+...
+```
+
+### Risks and Mitigations
+We are proposing a new field called `maxUnavailable` whose default value will be 1. In this mode, StatefulSet will behave exactly like its current behavior.
+Its possible we introduce a bug in the implementation. The mitigation currently is that is disabled by default in Alpha phase for people to try out and give
+feedback. 
+In Beta phase when its enabled by default, people will only see issues or bugs when `maxUnavailable` is set to something greater than 1. Since people have
+tried this feature in Alpha, we would have time to fix issues.
+
+
+### Upgrades/Downgrades
+
+- Upgrades
+ When upgrading from a release without this feature, to a release with maxUnavailable, we will set maxUnavailable to 1. This would give users the same default
+ behavior they have to come to expect of in previous releases
+- Downgrades
+ When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases
+ -- if maxUnavailable is greater than 1 -- in this case user  can see unexpected behavior(Find out what is the recommendation here(Warning, disable upgrade, drop field, etc? )
+ -- if maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior
+
+### Tests
+
+- maxUnavailable =1, Same behavior as today
+- maxUnavailable greater than 1 without partition
+- maxUnavailable greater than replicas without partition
+- maxUnavailable greater than 1 with partition and staged pods less then maxUnavailable
+- maxUnavailable greater than 1 with partition and staged pods same as maxUnavailable
+- maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable
+- maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas
+
+## Graduation Criteria
+
+- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default. 
+- Beta:  Enabled by default with default value of 1. 
+
+
+## Implementation History
+
+- KEP Started on 1/1/2019
+- Implementation PR and UT by 3/15
+
+## Drawbacks [optional]
+
+Why should this KEP _not_ be implemented.
+
+## Alternatives
+
+- Users who need StatefulSets stable identity and are ok with getting a slow rolling update will continue to use StatefulSets. Users who
+are not ok with a slow rolling update, will continue to use Deployments with workarounds for the scenarios mentioned in the Motivations
+section.
+- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement 
+your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility.
+