-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial StatefulSet updates proposal #503
Conversation
network configuration of the application, I want to update Services and | ||
container ports in a consistent way. | ||
- As the administrator of a stateful application, when I scale my application | ||
horizontally, I went associated PodDistruptionBudgets to be adjusted to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
- StatefulSet update must support singleton StatefulSets. However, an update in | ||
this case will cause a temporary outage. This is acceptable as a single | ||
process application is, by definition, not highly available. | ||
- Disruption in Kubernetes is controlled by PodDistruptionBugets. As |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disruption in Deployments and DaemonSets is controlled by a field in the Strategy of each controller: MaxUnavailable. There has been discussion (kubernetes/kubernetes#35318) about unifying that field with PDB but I don't think there is any consensus for this one yet. Based on kubernetes/kubernetes#35318 (comment) and kubernetes/kubernetes#35318 (comment) I think there is room for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only issue for StatefulSet is that with PDBs if we ever want to move faster than one Pod at a time we will have to create evictions instead of deleting Pods.
## Requirements | ||
This design is based on the following requirements. | ||
- Users must be able to update the containers of a StatefulSet's Pods. | ||
- Updates to container commands, images, resources and configuration must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that other parts of the PodTemplateSpec will be gated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for a first pass we should support just the above and then (potentially) relax the allowable mutable portions of .Template. Right now you can update Replicas and Containers. I think that is good enough to support configuration, resources, and image updates. Granted, you can cause an update by adding some arbitrary annotation to a Container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm - I'm having trouble coming up with a reason to gate template mutation, since we don't do it on any other "updatable" controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior will fail mutation on validation if anything other than Replicas and Containers is updated. We can change it to open it up in a backward compatible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior will fail mutation on validation if anything other than Replicas and Containers is updated. We can change it to open it up in a backward compatible way
What I mean is that allowing for more mutable fields can not break someones existing StatefulSet's. However, once me make something mutable, we can't take it back without breaking backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want to take it back? Eventually a StatefulSet should be updatable as any other controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If allowed Volumes to be updated, what would that semantically mean ? Is that an identity change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes we are considering now would be mutation to the VolumeClaimsTemplate with respect to resizing the storage. It would not be an identity change because we would not mutate the contents of the storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify here that the existing validation that is applied in the StatefulSet pod template will be lifted?
// CurrentTemplateRevision, if not nil, is the revision of the PodTemplate | ||
// that was used to create Pods with ordinals in the sequence | ||
// [0,CurrentRevisionReplicas). | ||
TemplateRevision *int64 `json:"templateRevision,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentTemplateRevision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a usability issue with Deployments that we need to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a usability issue with Deployments that we need to fix?
The fact that we don't output the revisions in the status, both the one we roll from and the one we roll to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm - why wouldn't this be generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton Daemon set uses TemplateGeneration. I will update the proposal such that StatefulSet will use the same.
@Kargakis I would think that, for all controllers, we would want to express the full current and target state as part of the status, but that its more important for StatefulSet than for Deployment.
// RevisionHistoryDepth is the maximum number of PodTemplates that will | ||
// be maintained in the StatefulSet's revision history. It must be at | ||
// least two. | ||
RevisionHisotryDepth int32 `json:historyRevisionDepth,omitempty` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RevisionHistoryLimit to be consistent with Deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
// Template. | ||
TemplateRevision *int64 `json:"templateRevision"` | ||
|
||
// RevisionParition paritions the Pods in the StatefulSet by ordinal such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos
// all Pods with an a greater or equal ordinal will be created from the | ||
// PodTemplate that represents the target revision of the StatefulSet's | ||
// revision history. | ||
RevisionPartition *int32 `json:"revisionPartition,omitempty` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why is this field needed? Is it for running canaries? We want something similar for the rest of the workload controllers but it's going to differ from this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its for canaries and staged roll outs. I think StatefulSet should be different from Deployments and ReplicaSets in this regard. Deployments and ReplicaSets are probabilistic in nature and StatefulSets are ordered and deterministic. I think that specifying a percentage makes more sense for Deployments and ReplicaSets while partitioning a StatefulSet based on ordinal makes the most sense. For a Deployment or ReplicaSet I want to say give me x% with a target configuration. For a StatefulSet I want to say update these members (which in this case live in a contiguous partition).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it is part of a strategy struct (like deployments). Can we imagine a different approach to update in the future? Custom rollout strategies? If so a strategy struct is a good hedge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That has precedence in DaemonSet, and it is a small level of effort for the benefit of the hedge
|
||
```go | ||
// StatefulSetPodTemplateLabel is the label applied to a PodTemplate to allow | ||
// the StatefulSet controller to select the PodTemplates in its revision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not rely on owner references?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To select them. OwnerReference is needed as well to avoid revision history overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you apply the same labels to the PodTemplates that you do to your Pods, then you select them with your normal controller selector, and filter by ControllerRef to handle overlapping selectors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming @enisoc's suggestion. +1 on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
|
||
### Target State | ||
The declared target state of a StatefulSet requires that all Pods in the | ||
StatefulSet conform to exactly one or two PodTemplates in the StatefulSet's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we rely on the StatefulSet PodTemplateSpec for the desired state? Are PodTemplates going to be read-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In essence we still will. The target revision of the StatefulSet's history will be semantically, deeply equivalent to its Spec.Template with the exception that it will carry the annotation for revision history that is applied to the PodTemplate. That is reason that the PodTemplate and not the Spec.Template will be used for creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say , confirm to exactly one or two PodTemplates
you mean two because some pods can be with old template and some can be with new template ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. A Pod conforms to exactly one of two templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kow3ns are you talking about the canary case ?
1. The StatefulSet's `.Status.TemplateRevision` is equal to its | ||
`.Status.TargetRevision`. | ||
1. All Pods in the StatefulSet have been generated from the PodTemplate | ||
labled with a `StatefulSetTemplateRevisionLabel` equal to its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
When the StatefulSet controller creates a PodTemplate for a StatefulSet, it will | ||
do the following. | ||
|
||
1. The controller will set the PodTemplate's `.PodTemplateSpec` field to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't feel right. How is the controller creating the PodTemplate in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The controller generates PodTemplates based on its current Template, when modified, in order to keep track of the modifications made to the Template. The idea is that for every revision to the StatefulSet's .Spec.Template we will generate a new PodTemplate that represents that particular revision and add it to the containers revision history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this point is worded, it seems to me that the controller is mutating the StatefulSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, you are using this wording elsewhere. I can get used to it:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there should be at least a link between the Template Update and PodTemplate Creation sections to make it clear that a Template Update is what triggers PodTemplate Creation.
StatefulSet's `.Spec.Template` field. | ||
1. The controller will create a ControllerRef object in the PodTemplate's | ||
`.OwnerReferences` list to mediate selector overlapping. | ||
1. The controller will label the PodTemplate with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the owner ref from the previous step is not enough and an additional label is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that we just list all PodTemplates and then filter by OwnerRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basically that's what @enisoc did for all the rest of the controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objection to doing it that way. It seems more efficient to select based on a label to get only the subset that I care about, and then filter those based on ControllerRef. Is there a benefit to listing all PodTemplates? The only reason I can think of to select them all is to continue to consider PodTemplates that have had the label intentionally removed by the user, but if the user intentionally removes the label isn't she indicating by explicit action that the PodTemplate should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case a PodTemplate A-1 is not selectable by a StatefulSet A anymore but A-1 has an owner reference to A, the StatefulSet controller needs to remove that owner reference.
https://github.com/kubernetes/community/blob/31d62c70214d66ca2c21716c02b83d9087fdf3ff/contributors/design-proposals/controller-ref.md#orphaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point above still applies. The only way this should happen is if the user explicitly mutates the label applied to the PodTemplate, or, if StatefulSet.Spec.Selector becomes mutable, by mutating the selector. Are we worried about orphan's occurring due to unintentional user error? If so, lets not make the selector mutable.
Also, I was not planning on having controllers adopt PodTemplates in the same way they adopt orphan Pods. A controller taking ownership of an orphaned Pod makes sense, a controller taking ownership of another controllers revision history, imo, does not. I don't think we need to solve for the same constraints as we did for Pods.
object's revision history. In practice, these PodTemplates will be filtered out | ||
prior to history maintenance. | ||
1. If the PodTemplate's ControllerRef matches the StatefulSet, the | ||
StatefulSet controller will delete the PodTemplate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be handled by the GC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if GC would work for PodTemplates, but it appears that it will, provided the OwnerReference is set. I will update to indicate that we will use GC. That is the direction I wanted to go eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, PodTemplates is not really a resource on the api server right, its just an array/field inside StatefulSet Spec. Not sure if GC will handle this @caesarxuchao ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question @krmayankk asked, isn't PodTemplate a field?
OwnerReferneces is a field in Objectmeta, which is only included in each top-level API object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodTemplates are top-level objects, but they are not used anywhere in Kubernetes today. PodTemplateSpec is the field that every controller has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It's good to know. Then GC can handle PodTemplate.
|
||
// CurrentRevisionReplicas is the number of Pods created by the StatefulSet | ||
// controller from the PodTemplateSpec indicated by CurrentTemplateRevision. | ||
CurrentReplicas int32 `json:"currentReplicas,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be confusing because once a StatefulSet is updated and a new PodTemplate is created, CurrentReplicas is really NeedToBeReplacedReplicas. I can't think of a better name now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't think of a better name either, but I'm very open to suggestion here. Also, do you have thoughts on the naming of PartitionRevision? I'd be happy to have a better name for this also.
|
||
// TargetRevisionReplicas is the number of Pods created by the StatefulSet | ||
// controller from the PodTemplateSpec indicated by TargetTemplateRevision. | ||
TargeReplicas int32 `json:"taretReplicas,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdatedReplicas to stay consistent with what we have in Deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
v1.PodTemplateSpec `json:"template"` | ||
|
||
// TemplateRevision is a monotonically increasing, 64 bit, integer used to | ||
// indicate the version of the of the PodTemplateSpec. If nil, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which component is going to increase this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this a generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update for consistency with DaemonSet.
process application is, by definition, not highly available. | ||
- Disruption in Kubernetes is controlled by PodDistruptionBugets. As | ||
StatefulSet updates progress one Pod at a time, and only occur when all | ||
other Pods have a Status of Running and a Ready Condition, they can not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DeamonSet we check if pod is Available (Ready + MinReadySeconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that done primarily to tolerate infant mortality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to read #478
Basically I am fine with not handling MinReadySeconds for StatefulSets since we are moving it in the PodSpec. Eventually, we should add the additional status field (StableReplicas).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah stable pods will be supported, but I didn't think adding that to StatefulSetStatus was relevant for this proposal. You already have an issue open for consistent Status objects for controllers, and I thought that work could be performed under that issue, or a StatefulSet specific sub issue.
[maintenance of its revision history](#history-maintenance). | ||
|
||
### StatefulSet Revision History | ||
The StatefulSet controller will use labeled, versioned PodTemplates to keep a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a final decision? In DaemonSets we didn't decided yet how to store a history. cc @Kargakis @janetkuo @smarterclayton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we haven't yet. I think we need to get a quick list of options written up and discussed:
- PodTemplate (collision between use for different controller types)
- Specific type for each controller: DaemonSetHistory, StatefulSetHistory, etc (no reuse between controllers)
- Storing a limited subset of specs on the object (size limits)
- ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm obviously partial to 1. My primary reason is that controllers should not be special and require magic. If we get TPRs right then PodTemplates can be leveraged by arbitrary operators/controllers, and they, therefore, provide a primitive upon which others can build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for pod template here's what we need to handle:
- Collisions across controllers by naming
- Idempotency of product versions (the same pod template in a controller object should map to exactly one PodTemplate across multiple upgrades)
- Human readability - a human should be able to find a PodTemplate
- Collisions across controllers by pod template spec - if two controllers deploy the exact same pod template (daemon set vs stateful set) the PodTemplate has two owners, not just one, so garbage collection won't work
- Future versions of controllers may involve renames of the controller resource name, so we would still have to be able to locate old PodTemplate revisions even if the resource name changes
- Immutability of PodTemplate - a template used in this fashion needs to prevent updates, or at least strongly discourage them (if they are mutable, we can't compare the pods against the pod template version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the real issue is 1 - the rest of the problems apply to ReplicaSets owned by Deployments too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I am talking about the UID of the parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton @janetkuo @Kargakis @lukaszo
Before we go forward with PodTemplate creation we should also discuss adding a History sub resource to the controllers (I believe that is something like option 3 above).
Having a per Kind History sub resource mitigates History overlap across Kinds with the same .Name
in the same namespace. My argument about PodTemplate being a primitive that is usable without controller specific magic is valid (imo), but perhaps its also a straw man.
Right now the category of controllers could loosely be defined like this.
A generative controller takes, as an argument, an Object that represents the declared target state. The Object must have the following properties
- A
.Spec.Template
that is a template for Objects of the generated Kind. - A
.Spec.Replicas
that expresses the desired cardinality of the generated Kind. - A Generation that indicates the version of the declared target state.
- A
.Status.Replicas
that indicates the number of created replicas. - If the Objects of created Kind are not immediately available upon creation, a
.Status.ReadyReplicas
to indicate the number of available created resources.- A
.Satus.ObservedGeneration
that indicates the version of the target state specification that the controller has observed.
- A
I confirmed with @foxish that the plan to support third party controller and operators is to make these properties well supported. That is, we want them to create custom Kinds that contain these sub resources. Given that, adding a History sub resource might be the best fit.
We could say that controllers that implement revision history must have
- A
History.Revision
where the type of this object is an[]PodTemplate
or[]PodTemplateSpec
. - A
History.RevisionLimit
that indicates the number of stored revisions. - A
Spec.TemplateGeneration
that is incremented by the API server upon mutation of the.Spec.Template
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been some discussion before and I think there is general agreement that we want a history subresource. I am not sure whether we need to have this discussion in this proposal or in a separate one since it 1. affects all of the workload controllers and 2. is not required for having StatefulSet upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a problem with putting uids in names of generated resources provided they are ONLY used for uniquification and NEVER used for discovery/searching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevention of mutation and deletion is also desired in other contexts (e.g., addons): kubernetes/kubernetes#10179
reconfiguration will be performed destructively. | ||
- Stateful applications are likely to evolve wire protocols and storage formats | ||
between versions. In most cases, when updating the application's Pod's | ||
containers, it will not be safe to roll back or forward to an arbitrary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to deployments as well - maybe indicate it's more likely, but that we're doing exactly what deployments do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean unlike Deployments, where you can rollback to any revision you want, StatefulSet can only be rolled back to the previous version only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means you can only have one target version and one current version.
update to some portion of the fleet maintained by the StatefulSet prior to | ||
updating the entire fleet. It is useful to support linear, geometric, and | ||
exponential roll out of an update. Users can modify the | ||
`.Spec.RevisionPartition` to allow the roll out to progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably generally useful as part of programming by a higher level workflow, although I question whether most users wouldn't just do this with two stateful sets (the fact that we can't do this with stateful sets today is another question).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that if we have termination reason and local storage backed PVs (both future features) we probably need to support the ability to do staged roll out in this way. Mainly this would be to support operators that manage this type of worklow using StatefulSet as their primitive.
updated node. | ||
It should be noted that this is already an issue for Node cordon and Pod eviction | ||
(due to drain or taints), and applications can use the same mitigation as they | ||
would for these events for StatefulSet update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an often requested enhancement to the platform, and it really belongs to "termination reason".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to call this out here. I'm going to offer a proposal for termination reason to, hopefully, be released concurrently with StatefulSet updates. There is a new contributor to sig-node who has offered to own this in the 1.7 time frame.
implementation, there is little value to implementing in place image updates, | ||
and configuration and resource request/limit updates are not possible. | ||
When [termination reason](#https://github.com/kubernetes/kubernetes/issues/1462) is implemented we may modify | ||
the behavior of StatefulSet update to only update, rather than delete and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are other reasons you may want new pods - this sounds like something that would be a strategy option (and applies to daemonsets as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For instance, if you resize a container to increase the mem and you have a JVM based application still (probably) want to restart it with new -Xmx -Xms.
Make reference to fact that Deployment and Stateful set both update from a specific revision to a specific revision
|
||
```go | ||
// StatefulSetPodTemplateLabel is the label applied to a PodTemplate to allow | ||
// the StatefulSet controller to select the PodTemplates in its revision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you apply the same labels to the PodTemplates that you do to your Pods, then you select them with your normal controller selector, and filter by ControllerRef to handle overlapping selectors?
1. The controller will label the PodTemplate with a | ||
`StatefulSetPodTemplateLabel` set to the StatefulSet's `.Name` to allow for | ||
selection of the PodTemplates that comprise the StatefulSet's revision history. | ||
1. The controller will label the PodTemplate with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "label the PodTemplate", do you mean adding it to PodTemplate.Metadata
, or PodTemplate.Spec
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I like the approach of using a single selector.
- The selectable label is applied to the object and the revision is applied to the spec. The former is not included in the Pod and the latter is.
1. The controller will label the PodTemplate with a | ||
`StaefulSetTemplateRevisionLabel` set to the StatefulSet's | ||
`.Spec.TemplateRevision`. | ||
1. The controller will set the Name of the PodTemplate to a concatenation of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to include a generated string to prevent name collisions with PodTemplates created by other controllers in the future? For SS pods, the names had to be deterministic since they act as locks. Do you need that property for PodTemplates too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is that if someone wanted to GET the PodTemplate for a StatefulSet at a particular revision to view it or to use it as a base when creating a new spec, its name is deterministic and therefor the object is retrievable. You could also select all appropriately labeled templates though. We need uniqueness for discrimination not for locking, but determinism is nice for inspection imo.
prior to history maintenance. | ||
1. If the PodTemplate's ControllerRef matches the StatefulSet, the | ||
StatefulSet controller orphan the PodTemplate by removing its ControllerRef, | ||
and it will allow the PodTemplate to be deleted via garbage collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GC does not delete orphans. It only deletes things that have more than 0 OwnerReferences, all of which refer to objects that are deleted or deleting (have non-nil DeletionTimestamp).
If you want to delete something you own, you can just delete it directly without using the GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if you want to cleanup older PodTemplates not conforming to RevisionHistoryLimit, you'll need to do it in the StatefulSet controller. Did I say something different on this topic yesterday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to indicate orphaning to trigger GC but the generic GC implementation just requires a delete.
1. The controller will | ||
[reconstruct the revision history](#history-reconsturction) of the | ||
StatefulSet. | ||
1. If the revision history of the StatefulSet contains a PodTemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do this step instead of always creating a new PodTemplate? It seems weird that .Spec.TemplateRevision
is not going to equal .Status.TargetTemplateRevision
if this step triggers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to test for deep equality to do a rollback to a previous revision and avoid unintentional updates of Pods. For instance, if we generate a new PodTemplate and 90% of the fleet was generated from a PodTemplate that is semantically, deeply equivalent, we don't want to touch that portion of the fleet.
We could consider a more complicated approach where we update the label on the existing pod to conform to the equivalent revision. This might be difficult for users to reason about.
Users can use the following command to create the StatefulSet. | ||
|
||
```shell | ||
kubectl create -f web.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you plan to use apply
later, you should use apply
for the initial creation as well: https://kubernetes.io/docs/user-guide/kubectl/v1.6/#apply
deployment failure, but rolling forward is sometimes the only practical solution | ||
for stateful applications (e.g. A users has a minor configuration error but has | ||
already modified the storage format for the application). Users can use | ||
sequential `kubectl apply`'s to update the `.Status.TargetRevision` of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users shouldn't modify anything in Status
, right?
(due to drain or taints), and applications can use the same mitigation as they | ||
would for these events for StatefulSet update. | ||
|
||
### VolumeTemplatesSpec Updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this also be valuable for the simpler case where your data is growing and you want to vertically scale each stateful pod by increasing its volume size? Or is there some reason you only mention temporary changes as being relevant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is valuable for that use case as well. My experience has been that with traditional RDMS's you start seeing exponential disk utilization increases at around ~75% disk capacity for rotational media, but the growth towards that point on the utilization curve tends to be linear with respect to the amount of data stored, and this is driven by the number of new rows. Usually, provided you have some monitoring around it, you can adapt proactively prior to an outage.
I call out log structure merge trees and append only B trees in particular because people are often surprised when the write path blocks completely due to a compaction failure, and the size of a compacted SSTable or append only BTree is variable based on the mutation rate, not the number of newly inserted records, making it hard to predict.
|
||
## API Object | ||
|
||
The following modifications will be made to the StatefulSetStatus API object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the status changes below the spec changes? Seems more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
// TemplateGeneration, if not nil, is the generation of the PodTemplate that was | ||
// used to create Pods with ordinals in the sequence | ||
// [0,CurrentReplicas). | ||
TemplateGeneration *int64 `json:"templateGeneration,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentTemplateGeneration to match CurrentReplicas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
// TargetTemplateGeneration, if not nil, is the generation of the PodTemplate | ||
// that was used to create Pods with ordinals in the sequence | ||
// [Replicas - UpdatedReplicas, Replicas). | ||
TargetTemplateGeneration *int64 `json:"targetTemplateGeneration,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdatedTemplateGeneration to match UpdatedReplicas? I don't feel strong about any of those names but we should at least be consistent with what we are going to pick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
Phased roll outs can be used to roll out a configuration, image, or resource | ||
update to some portion of the fleet maintained by the StatefulSet prior to | ||
updating the entire fleet. It is useful to support linear, geometric, and | ||
exponential roll out of an update. Users can modify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by useful to support linear, geometric
etc ? Do you mean there should be some kind of support in the statefulset controller ? I am curious to understand this more. I was thinking you would get linear, geometric by update the ordinal number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It means that the Parition strategy allows a higher level controller to achieve those ends.
|
||
The command below will roll back the `web` StatefulSet to the previous revision in | ||
its history. If a roll out is in progress, it will stop deploying the target | ||
revision, and roll back to the current revision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we state somewhere explictly that we wont support RollbackTo arbitrary revisions for statefulsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It scopes this at the beginning of the document
is implemented we may modify the behavior of StatefulSet update to only update, | ||
rather than delete and create, Pods when the only mutated value is the container | ||
image, and if resizable resource request/limits is implemented, we may extend | ||
the above to allow for updates to Pod resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It states that above.
intermittent compaction as a form of garbage collection. Applications that use | ||
log structured merge trees with size tiered compaction (e.g Cassandra) or append | ||
only B(+/*) Trees (e.g Couchbase) can temporarily double their storage usage when | ||
compacting their on disk storage. If there is insufficient space for compaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
// OnDeleteStatefulSetStrategyType triggers the legacy behavior. Version | ||
// tracking and ordered rolling restarts are disabled. Pods are recreated | ||
// from the StatefulSetSpec when they are manually deleted. | ||
OnDeleteStatefulSetStrategyType = "OnDelete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a reason to not support Recreate
, call it out in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OnDelete
might be somewhat unfamiliar as well. We could potentially call it Replacement
or ReplacementUpdate
(since we are only recreating when a pod needs tobe replaced).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RollingUpdate and OnDelete are for consistency with DaemonSet. These are already implemented. I don't see a strong reason to make them inconsistent here.
// all possible update strategies for the StatefulSet controller. | ||
type StatefulSetUpdateStrategyType string | ||
|
||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strategies should define their behavior on scale up/down during an update in the godoc
a StatefulSet, as proposed [here](https://github.com/kubernetes/community/pull/541), | ||
the tenant application has no way to determine if it is being terminated due to | ||
a scale down operation or due to an update. | ||
Consider a BASE distributed storage application like Cassandra, where 2 TiB of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline here to start a new para
No significant concerns left, except some fiddling with naming. Since updates will be alpha in 1.7 we should clearly communicate that in Godoc during implementation and that we reserve the right to change the names. |
This is LGTM. If anyone has remaining comments on this or hasn't reviewed the latest and wants to chime in, please do so by EOD May 18th. This looks like a workable step. |
Type StatefulSetUpdateStrategyType | ||
// Partition is used to communicate the ordinal at which to partition | ||
// the StatefulSet when Type is PartitionStatefulSetStrategyType. This | ||
// value must be set when Type is PartitionStatefulSetStrategyType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My undestanding was that by unsetting this we could allow the rollout to proceed like a normal Rolling rollout. Isn't that the case?
// UpdatedReplicas is the number of Pods created by the StatefulSet | ||
// controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated | ||
// by UpdateRevision. | ||
UpdatedReplicas int32 `json:"taretReplicas,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json tag needs to be updated
entry point commands or parameters, or configuration files. | ||
- As the administrator of the logging and monitoring infrastructure for my | ||
organization, in order to add logging and monitoring side cars, I want to patch | ||
containers to add images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/"I want to patch containers to add images."/"I want to patch Pods to add containers."/
UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty` | ||
|
||
// RevisionHistoryLimit is the maximum number of revisions that will | ||
// be maintained in the StatefulSet's revision history. The revision history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the controller history proposal, this would be the number of non-live revisions (so the entire history including live revisions could exceed this limit, if I understand correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the "revision history" as described by the RevisionHistoryLimit contains only non-live revisions.
CurrentRevision string `json:"currentRevision,omitempty"` | ||
|
||
// UpdateRevision, if not empty, indicates the version of PodSpecTemplate, | ||
// VolumeClaimsTemplate tuple used to generate Pods in the sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: alignment is off on the comments for these fields.
// CurrentReplicas is the number of Pods created by the StatefulSet | ||
// controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated | ||
// by CurrentRevision. | ||
CurrentReplicas int32 `json:"currentReplicas,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name here is kind of ambiguous -- one might interpret CurrentReplicas
as the current number of replicas in the stateful set, instead of the number of replicas at the CurrentRevision. Maybe something like ReplicasAtCurrentRevision
and ReplicasAtUpdateRevision
would be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have had multiple discussions over the naming and have landed on this for consistency with DaemonSet. I don't see a reason to change the naming yet again.
1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, | ||
the Pod should be consistent with version indicated by `Status.CurrentRevision`. | ||
1. If the Pod's ordinal is in the sequence | ||
`[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)`the Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing space between ` and "the"
`RollingUpdateStatefulSetStrategyType` then the version of the Pod should be | ||
as follows. | ||
1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`, | ||
the Pod should be consistent with version indicated by `Status.CurrentRevision`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a rolling update, isn't the target pod state for all pods in [0,.Spec.Replicas)
the UpdateRevision
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like "target pod state" is being overloaded here -- I interpret it as the goal state for the rolling update, but the description here seems to suggest that it is referring to the pod's point in time state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target state for Pods needs to be consistent with respect to the StatefulSet invariants. Target Pod state refers to the desired state of a Pod during an individual iteration of the control loop. Target state referrers to the desired state the controller wishes to evolve the system too.
#### History Maintenance | ||
In order to prevent the revision history of the StatefulSet from exceeding | ||
memory or storage limits, the StatefulSet controller will periodically prune | ||
its revision history so that no more that `.Spec.RevisionHisotryLimit` non-live |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/Hisotry/History/
`PartitionStatefulSetStrategyType`, the API Server should fail validation | ||
if any of the following conditions are true. | ||
1. `.Spec.UpdateStrategy.Partition` is nil. | ||
1. `.Spec.UpdateStratgegy.Parition` is not nil, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/UpdateStratgegy.Parition/UpdateStrategy.Partition/
### Canaries | ||
Users can create a canary using `kubectl apply`. The only difference between a | ||
[rolling update](#rolling-out-an-update) and a canary is that the | ||
`.Spec.UpdateStrategy.Type` is set to `ParitionedStatefulSetStrategyType` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/Paritioned/Partitioned/
1. If all Pods in the sequence `[0,.Spec.Replicas)` have been created, but if any | ||
do not have a Ready Condition, the StatefulSet controller will wait for these | ||
Pods to either become Ready, or to be completely deleted. | ||
1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Ready Condition, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any plans for support of minReadySeconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maxUnavailable to allow more than one replica to be taken down at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general option than maxUnavailable - covers scaling and node drains too - is added in kubernetes/kubernetes#44899
|
||
1. The controller will increment the `.Status.ObservedGeneration` to communicate | ||
the `.Generation` of the StatefulSet object that was observed. | ||
1. The controller will set the `.Status.Replicas` to the current number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a section somewhere and mention that we are breaking this field with this proposal - see kubernetes/kubernetes#42410
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning will be the same with the rest of the controllers, it hasn't been so far.
I'm going to merge COB PDT unless there are any objections. |
None from me. |
One last issue I see in this proposal is the way the completion of an update is signaled. Specifically, the last step:
UpdatedReplicas means the replicas that have been updated to the latest revision, setting it to zero means that there are no updated replicas. By doing this to signal completion, we are basically having an API field that means different things in different contexts, i.e., the number of updated replicas during an update and it just signals completion during stable state. We don't do this anywhere else in our APIs and I feel it's an anti-pattern. |
I would expect the completion of an update to be a simple equation:
We can have Conditions in a follow-up to communicate completion in a more human-friendly way. |
Also, I see more reasons in having a single strategy that includes partitioning as an option. See 1. https://github.com/kubernetes/kubernetes/pull/46669/files#r119898826 and 2. how complex the validation ends up @kubernetes/sig-apps-api-reviews |
initial StatefulSet updates proposal
@kubernetes/sig-apps-api-reviews
@kubernetes/sig-apps-feature-requests
@smarterclayton
@Kargakis
@erictune
@janetkuo
@enisoc
@foxish