Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial StatefulSet updates proposal #503

Merged
merged 12 commits into from
May 27, 2017
Merged

initial StatefulSet updates proposal #503

merged 12 commits into from
May 27, 2017

Conversation

kow3ns
Copy link
Member

@kow3ns kow3ns commented Apr 3, 2017

@kubernetes/sig-apps-api-reviews
@kubernetes/sig-apps-feature-requests
@smarterclayton
@Kargakis
@erictune
@janetkuo
@enisoc
@foxish

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2017
network configuration of the application, I want to update Services and
container ports in a consistent way.
- As the administrator of a stateful application, when I scale my application
horizontally, I went associated PodDistruptionBudgets to be adjusted to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

- StatefulSet update must support singleton StatefulSets. However, an update in
this case will cause a temporary outage. This is acceptable as a single
process application is, by definition, not highly available.
- Disruption in Kubernetes is controlled by PodDistruptionBugets. As
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disruption in Deployments and DaemonSets is controlled by a field in the Strategy of each controller: MaxUnavailable. There has been discussion (kubernetes/kubernetes#35318) about unifying that field with PDB but I don't think there is any consensus for this one yet. Based on kubernetes/kubernetes#35318 (comment) and kubernetes/kubernetes#35318 (comment) I think there is room for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue for StatefulSet is that with PDBs if we ever want to move faster than one Pod at a time we will have to create evictions instead of deleting Pods.

## Requirements
This design is based on the following requirements.
- Users must be able to update the containers of a StatefulSet's Pods.
- Updates to container commands, images, resources and configuration must be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that other parts of the PodTemplateSpec will be gated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for a first pass we should support just the above and then (potentially) relax the allowable mutable portions of .Template. Right now you can update Replicas and Containers. I think that is good enough to support configuration, resources, and image updates. Granted, you can cause an update by adding some arbitrary annotation to a Container.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm - I'm having trouble coming up with a reason to gate template mutation, since we don't do it on any other "updatable" controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior will fail mutation on validation if anything other than Replicas and Containers is updated. We can change it to open it up in a backward compatible way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior will fail mutation on validation if anything other than Replicas and Containers is updated. We can change it to open it up in a backward compatible way

What I mean is that allowing for more mutable fields can not break someones existing StatefulSet's. However, once me make something mutable, we can't take it back without breaking backward compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want to take it back? Eventually a StatefulSet should be updatable as any other controller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If allowed Volumes to be updated, what would that semantically mean ? Is that an identity change ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes we are considering now would be mutation to the VolumeClaimsTemplate with respect to resizing the storage. It would not be an identity change because we would not mutate the contents of the storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify here that the existing validation that is applied in the StatefulSet pod template will be lifted?

// CurrentTemplateRevision, if not nil, is the revision of the PodTemplate
// that was used to create Pods with ordinals in the sequence
// [0,CurrentRevisionReplicas).
TemplateRevision *int64 `json:"templateRevision,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentTemplateRevision

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a usability issue with Deployments that we need to fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a usability issue with Deployments that we need to fix?

The fact that we don't output the revisions in the status, both the one we roll from and the one we roll to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm - why wouldn't this be generation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton Daemon set uses TemplateGeneration. I will update the proposal such that StatefulSet will use the same.
@Kargakis I would think that, for all controllers, we would want to express the full current and target state as part of the status, but that its more important for StatefulSet than for Deployment.

// RevisionHistoryDepth is the maximum number of PodTemplates that will
// be maintained in the StatefulSet's revision history. It must be at
// least two.
RevisionHisotryDepth int32 `json:historyRevisionDepth,omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RevisionHistoryLimit to be consistent with Deployments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

// Template.
TemplateRevision *int64 `json:"templateRevision"`

// RevisionParition paritions the Pods in the StatefulSet by ordinal such
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos

// all Pods with an a greater or equal ordinal will be created from the
// PodTemplate that represents the target revision of the StatefulSet's
// revision history.
RevisionPartition *int32 `json:"revisionPartition,omitempty`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why is this field needed? Is it for running canaries? We want something similar for the rest of the workload controllers but it's going to differ from this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its for canaries and staged roll outs. I think StatefulSet should be different from Deployments and ReplicaSets in this regard. Deployments and ReplicaSets are probabilistic in nature and StatefulSets are ordered and deterministic. I think that specifying a percentage makes more sense for Deployments and ReplicaSets while partitioning a StatefulSet based on ordinal makes the most sense. For a Deployment or ReplicaSet I want to say give me x% with a target configuration. For a StatefulSet I want to say update these members (which in this case live in a contiguous partition).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it is part of a strategy struct (like deployments). Can we imagine a different approach to update in the future? Custom rollout strategies? If so a strategy struct is a good hedge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That has precedence in DaemonSet, and it is a small level of effort for the benefit of the hedge


```go
// StatefulSetPodTemplateLabel is the label applied to a PodTemplate to allow
// the StatefulSet controller to select the PodTemplates in its revision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rely on owner references?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To select them. OwnerReference is needed as well to avoid revision history overlap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you apply the same labels to the PodTemplates that you do to your Pods, then you select them with your normal controller selector, and filter by ControllerRef to handle overlapping selectors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming @enisoc's suggestion. +1 on that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM


### Target State
The declared target state of a StatefulSet requires that all Pods in the
StatefulSet conform to exactly one or two PodTemplates in the StatefulSet's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we rely on the StatefulSet PodTemplateSpec for the desired state? Are PodTemplates going to be read-only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In essence we still will. The target revision of the StatefulSet's history will be semantically, deeply equivalent to its Spec.Template with the exception that it will carry the annotation for revision history that is applied to the PodTemplate. That is reason that the PodTemplate and not the Spec.Template will be used for creation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say , confirm to exactly one or two PodTemplates you mean two because some pods can be with old template and some can be with new template ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. A Pod conforms to exactly one of two templates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kow3ns are you talking about the canary case ?

1. The StatefulSet's `.Status.TemplateRevision` is equal to its
`.Status.TargetRevision`.
1. All Pods in the StatefulSet have been generated from the PodTemplate
labled with a `StatefulSetTemplateRevisionLabel` equal to its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

When the StatefulSet controller creates a PodTemplate for a StatefulSet, it will
do the following.

1. The controller will set the PodTemplate's `.PodTemplateSpec` field to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel right. How is the controller creating the PodTemplate in the first place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller generates PodTemplates based on its current Template, when modified, in order to keep track of the modifications made to the Template. The idea is that for every revision to the StatefulSet's .Spec.Template we will generate a new PodTemplate that represents that particular revision and add it to the containers revision history.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this point is worded, it seems to me that the controller is mutating the StatefulSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, you are using this wording elsewhere. I can get used to it:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there should be at least a link between the Template Update and PodTemplate Creation sections to make it clear that a Template Update is what triggers PodTemplate Creation.

StatefulSet's `.Spec.Template` field.
1. The controller will create a ControllerRef object in the PodTemplate's
`.OwnerReferences` list to mediate selector overlapping.
1. The controller will label the PodTemplate with a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the owner ref from the previous step is not enough and an additional label is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that we just list all PodTemplates and then filter by OwnerRef?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically that's what @enisoc did for all the rest of the controllers.

Copy link
Member Author

@kow3ns kow3ns Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to doing it that way. It seems more efficient to select based on a label to get only the subset that I care about, and then filter those based on ControllerRef. Is there a benefit to listing all PodTemplates? The only reason I can think of to select them all is to continue to consider PodTemplates that have had the label intentionally removed by the user, but if the user intentionally removes the label isn't she indicating by explicit action that the PodTemplate should be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case a PodTemplate A-1 is not selectable by a StatefulSet A anymore but A-1 has an owner reference to A, the StatefulSet controller needs to remove that owner reference.
https://github.com/kubernetes/community/blob/31d62c70214d66ca2c21716c02b83d9087fdf3ff/contributors/design-proposals/controller-ref.md#orphaning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point above still applies. The only way this should happen is if the user explicitly mutates the label applied to the PodTemplate, or, if StatefulSet.Spec.Selector becomes mutable, by mutating the selector. Are we worried about orphan's occurring due to unintentional user error? If so, lets not make the selector mutable.
Also, I was not planning on having controllers adopt PodTemplates in the same way they adopt orphan Pods. A controller taking ownership of an orphaned Pod makes sense, a controller taking ownership of another controllers revision history, imo, does not. I don't think we need to solve for the same constraints as we did for Pods.

object's revision history. In practice, these PodTemplates will be filtered out
prior to history maintenance.
1. If the PodTemplate's ControllerRef matches the StatefulSet, the
StatefulSet controller will delete the PodTemplate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be handled by the GC?

Copy link
Member Author

@kow3ns kow3ns Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if GC would work for PodTemplates, but it appears that it will, provided the OwnerReference is set. I will update to indicate that we will use GC. That is the direction I wanted to go eventually.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, PodTemplates is not really a resource on the api server right, its just an array/field inside StatefulSet Spec. Not sure if GC will handle this @caesarxuchao ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question @krmayankk asked, isn't PodTemplate a field?

OwnerReferneces is a field in Objectmeta, which is only included in each top-level API object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodTemplates are top-level objects, but they are not used anywhere in Kubernetes today. PodTemplateSpec is the field that every controller has.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It's good to know. Then GC can handle PodTemplate.


// CurrentRevisionReplicas is the number of Pods created by the StatefulSet
// controller from the PodTemplateSpec indicated by CurrentTemplateRevision.
CurrentReplicas int32 `json:"currentReplicas,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be confusing because once a StatefulSet is updated and a new PodTemplate is created, CurrentReplicas is really NeedToBeReplacedReplicas. I can't think of a better name now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think of a better name either, but I'm very open to suggestion here. Also, do you have thoughts on the naming of PartitionRevision? I'd be happy to have a better name for this also.


// TargetRevisionReplicas is the number of Pods created by the StatefulSet
// controller from the PodTemplateSpec indicated by TargetTemplateRevision.
TargeReplicas int32 `json:"taretReplicas,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdatedReplicas to stay consistent with what we have in Deployments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

v1.PodTemplateSpec `json:"template"`

// TemplateRevision is a monotonically increasing, 64 bit, integer used to
// indicate the version of the of the PodTemplateSpec. If nil, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which component is going to increase this field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API Server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this a generation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update for consistency with DaemonSet.

process application is, by definition, not highly available.
- Disruption in Kubernetes is controlled by PodDistruptionBugets. As
StatefulSet updates progress one Pod at a time, and only occur when all
other Pods have a Status of Running and a Ready Condition, they can not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DeamonSet we check if pod is Available (Ready + MinReadySeconds)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that done primarily to tolerate infant mortality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to read #478

Basically I am fine with not handling MinReadySeconds for StatefulSets since we are moving it in the PodSpec. Eventually, we should add the additional status field (StableReplicas).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah stable pods will be supported, but I didn't think adding that to StatefulSetStatus was relevant for this proposal. You already have an issue open for consistent Status objects for controllers, and I thought that work could be performed under that issue, or a StatefulSet specific sub issue.

[maintenance of its revision history](#history-maintenance).

### StatefulSet Revision History
The StatefulSet controller will use labeled, versioned PodTemplates to keep a
Copy link
Contributor

@lukaszo lukaszo Apr 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a final decision? In DaemonSets we didn't decided yet how to store a history. cc @Kargakis @janetkuo @smarterclayton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we haven't yet. I think we need to get a quick list of options written up and discussed:

  1. PodTemplate (collision between use for different controller types)
  2. Specific type for each controller: DaemonSetHistory, StatefulSetHistory, etc (no reuse between controllers)
  3. Storing a limited subset of specs on the object (size limits)
  4. ???

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm obviously partial to 1. My primary reason is that controllers should not be special and require magic. If we get TPRs right then PodTemplates can be leveraged by arbitrary operators/controllers, and they, therefore, provide a primitive upon which others can build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for pod template here's what we need to handle:

  1. Collisions across controllers by naming
  2. Idempotency of product versions (the same pod template in a controller object should map to exactly one PodTemplate across multiple upgrades)
  3. Human readability - a human should be able to find a PodTemplate
  4. Collisions across controllers by pod template spec - if two controllers deploy the exact same pod template (daemon set vs stateful set) the PodTemplate has two owners, not just one, so garbage collection won't work
  5. Future versions of controllers may involve renames of the controller resource name, so we would still have to be able to locate old PodTemplate revisions even if the resource name changes
  6. Immutability of PodTemplate - a template used in this fashion needs to prevent updates, or at least strongly discourage them (if they are mutable, we can't compare the pods against the pod template version).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real issue is 1 - the rest of the problems apply to ReplicaSets owned by Deployments too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I am talking about the UID of the parent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton @janetkuo @Kargakis @lukaszo

Before we go forward with PodTemplate creation we should also discuss adding a History sub resource to the controllers (I believe that is something like option 3 above).

Having a per Kind History sub resource mitigates History overlap across Kinds with the same .Name in the same namespace. My argument about PodTemplate being a primitive that is usable without controller specific magic is valid (imo), but perhaps its also a straw man.

Right now the category of controllers could loosely be defined like this.
A generative controller takes, as an argument, an Object that represents the declared target state. The Object must have the following properties

  1. A .Spec.Template that is a template for Objects of the generated Kind.
  2. A .Spec.Replicas that expresses the desired cardinality of the generated Kind.
  3. A Generation that indicates the version of the declared target state.
  4. A .Status.Replicas that indicates the number of created replicas.
  5. If the Objects of created Kind are not immediately available upon creation, a .Status.ReadyReplicas to indicate the number of available created resources.
    1. A .Satus.ObservedGeneration that indicates the version of the target state specification that the controller has observed.

I confirmed with @foxish that the plan to support third party controller and operators is to make these properties well supported. That is, we want them to create custom Kinds that contain these sub resources. Given that, adding a History sub resource might be the best fit.

We could say that controllers that implement revision history must have

  1. A History.Revision where the type of this object is an []PodTemplate or []PodTemplateSpec.
  2. A History.RevisionLimit that indicates the number of stored revisions.
  3. A Spec.TemplateGeneration that is incremented by the API server upon mutation of the .Spec.Template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been some discussion before and I think there is general agreement that we want a history subresource. I am not sure whether we need to have this discussion in this proposal or in a separate one since it 1. affects all of the workload controllers and 2. is not required for having StatefulSet upgrades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a problem with putting uids in names of generated resources provided they are ONLY used for uniquification and NEVER used for discovery/searching.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevention of mutation and deletion is also desired in other contexts (e.g., addons): kubernetes/kubernetes#10179

reconfiguration will be performed destructively.
- Stateful applications are likely to evolve wire protocols and storage formats
between versions. In most cases, when updating the application's Pod's
containers, it will not be safe to roll back or forward to an arbitrary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to deployments as well - maybe indicate it's more likely, but that we're doing exactly what deployments do here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean unlike Deployments, where you can rollback to any revision you want, StatefulSet can only be rolled back to the previous version only ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means you can only have one target version and one current version.

update to some portion of the fleet maintained by the StatefulSet prior to
updating the entire fleet. It is useful to support linear, geometric, and
exponential roll out of an update. Users can modify the
`.Spec.RevisionPartition` to allow the roll out to progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably generally useful as part of programming by a higher level workflow, although I question whether most users wouldn't just do this with two stateful sets (the fact that we can't do this with stateful sets today is another question).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that if we have termination reason and local storage backed PVs (both future features) we probably need to support the ability to do staged roll out in this way. Mainly this would be to support operators that manage this type of worklow using StatefulSet as their primitive.

updated node.
It should be noted that this is already an issue for Node cordon and Pod eviction
(due to drain or taints), and applications can use the same mitigation as they
would for these events for StatefulSet update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an often requested enhancement to the platform, and it really belongs to "termination reason".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to call this out here. I'm going to offer a proposal for termination reason to, hopefully, be released concurrently with StatefulSet updates. There is a new contributor to sig-node who has offered to own this in the 1.7 time frame.

implementation, there is little value to implementing in place image updates,
and configuration and resource request/limit updates are not possible.
When [termination reason](#https://github.com/kubernetes/kubernetes/issues/1462) is implemented we may modify
the behavior of StatefulSet update to only update, rather than delete and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other reasons you may want new pods - this sounds like something that would be a strategy option (and applies to daemonsets as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For instance, if you resize a container to increase the mem and you have a JVM based application still (probably) want to restart it with new -Xmx -Xms.

Make reference to fact that Deployment and Stateful set both update from
a specific revision to a specific revision

```go
// StatefulSetPodTemplateLabel is the label applied to a PodTemplate to allow
// the StatefulSet controller to select the PodTemplates in its revision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you apply the same labels to the PodTemplates that you do to your Pods, then you select them with your normal controller selector, and filter by ControllerRef to handle overlapping selectors?

1. The controller will label the PodTemplate with a
`StatefulSetPodTemplateLabel` set to the StatefulSet's `.Name` to allow for
selection of the PodTemplates that comprise the StatefulSet's revision history.
1. The controller will label the PodTemplate with a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "label the PodTemplate", do you mean adding it to PodTemplate.Metadata, or PodTemplate.Spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I like the approach of using a single selector.
  2. The selectable label is applied to the object and the revision is applied to the spec. The former is not included in the Pod and the latter is.

1. The controller will label the PodTemplate with a
`StaefulSetTemplateRevisionLabel` set to the StatefulSet's
`.Spec.TemplateRevision`.
1. The controller will set the Name of the PodTemplate to a concatenation of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to include a generated string to prevent name collisions with PodTemplates created by other controllers in the future? For SS pods, the names had to be deterministic since they act as locks. Do you need that property for PodTemplates too?

Copy link
Member Author

@kow3ns kow3ns Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is that if someone wanted to GET the PodTemplate for a StatefulSet at a particular revision to view it or to use it as a base when creating a new spec, its name is deterministic and therefor the object is retrievable. You could also select all appropriately labeled templates though. We need uniqueness for discrimination not for locking, but determinism is nice for inspection imo.

prior to history maintenance.
1. If the PodTemplate's ControllerRef matches the StatefulSet, the
StatefulSet controller orphan the PodTemplate by removing its ControllerRef,
and it will allow the PodTemplate to be deleted via garbage collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GC does not delete orphans. It only deletes things that have more than 0 OwnerReferences, all of which refer to objects that are deleted or deleting (have non-nil DeletionTimestamp).

If you want to delete something you own, you can just delete it directly without using the GC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if you want to cleanup older PodTemplates not conforming to RevisionHistoryLimit, you'll need to do it in the StatefulSet controller. Did I say something different on this topic yesterday?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to indicate orphaning to trigger GC but the generic GC implementation just requires a delete.

1. The controller will
[reconstruct the revision history](#history-reconsturction) of the
StatefulSet.
1. If the revision history of the StatefulSet contains a PodTemplate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this step instead of always creating a new PodTemplate? It seems weird that .Spec.TemplateRevision is not going to equal .Status.TargetTemplateRevision if this step triggers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to test for deep equality to do a rollback to a previous revision and avoid unintentional updates of Pods. For instance, if we generate a new PodTemplate and 90% of the fleet was generated from a PodTemplate that is semantically, deeply equivalent, we don't want to touch that portion of the fleet.
We could consider a more complicated approach where we update the label on the existing pod to conform to the equivalent revision. This might be difficult for users to reason about.

Users can use the following command to create the StatefulSet.

```shell
kubectl create -f web.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to use apply later, you should use apply for the initial creation as well: https://kubernetes.io/docs/user-guide/kubectl/v1.6/#apply

deployment failure, but rolling forward is sometimes the only practical solution
for stateful applications (e.g. A users has a minor configuration error but has
already modified the storage format for the application). Users can use
sequential `kubectl apply`'s to update the `.Status.TargetRevision` of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users shouldn't modify anything in Status, right?

(due to drain or taints), and applications can use the same mitigation as they
would for these events for StatefulSet update.

### VolumeTemplatesSpec Updates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also be valuable for the simpler case where your data is growing and you want to vertically scale each stateful pod by increasing its volume size? Or is there some reason you only mention temporary changes as being relevant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is valuable for that use case as well. My experience has been that with traditional RDMS's you start seeing exponential disk utilization increases at around ~75% disk capacity for rotational media, but the growth towards that point on the utilization curve tends to be linear with respect to the amount of data stored, and this is driven by the number of new rows. Usually, provided you have some monitoring around it, you can adapt proactively prior to an outage.
I call out log structure merge trees and append only B trees in particular because people are often surprised when the write path blocks completely due to a compaction failure, and the size of a compacted SSTable or append only BTree is variable based on the mutation rate, not the number of newly inserted records, making it hard to predict.


## API Object

The following modifications will be made to the StatefulSetStatus API object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the status changes below the spec changes? Seems more intuitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

// TemplateGeneration, if not nil, is the generation of the PodTemplate that was
// used to create Pods with ordinals in the sequence
// [0,CurrentReplicas).
TemplateGeneration *int64 `json:"templateGeneration,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentTemplateGeneration to match CurrentReplicas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

// TargetTemplateGeneration, if not nil, is the generation of the PodTemplate
// that was used to create Pods with ordinals in the sequence
// [Replicas - UpdatedReplicas, Replicas).
TargetTemplateGeneration *int64 `json:"targetTemplateGeneration,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdatedTemplateGeneration to match UpdatedReplicas? I don't feel strong about any of those names but we should at least be consistent with what we are going to pick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

Phased roll outs can be used to roll out a configuration, image, or resource
update to some portion of the fleet maintained by the StatefulSet prior to
updating the entire fleet. It is useful to support linear, geometric, and
exponential roll out of an update. Users can modify the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by useful to support linear, geometric etc ? Do you mean there should be some kind of support in the statefulset controller ? I am curious to understand this more. I was thinking you would get linear, geometric by update the ordinal number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It means that the Parition strategy allows a higher level controller to achieve those ends.


The command below will roll back the `web` StatefulSet to the previous revision in
its history. If a roll out is in progress, it will stop deploying the target
revision, and roll back to the current revision.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we state somewhere explictly that we wont support RollbackTo arbitrary revisions for statefulsets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It scopes this at the beginning of the document

is implemented we may modify the behavior of StatefulSet update to only update,
rather than delete and create, Pods when the only mutated value is the container
image, and if resizable resource request/limits is implemented, we may extend
the above to allow for updates to Pod resources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis @kow3ns so for just image updates, we could do inplace update of the pod and not delete it for statefulsets and deployments. The containers will have to be killed though. Is this something we are considering in the future ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It states that above.

intermittent compaction as a form of garbage collection. Applications that use
log structured merge trees with size tiered compaction (e.g Cassandra) or append
only B(+/*) Trees (e.g Couchbase) can temporarily double their storage usage when
compacting their on disk storage. If there is insufficient space for compaction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

// OnDeleteStatefulSetStrategyType triggers the legacy behavior. Version
// tracking and ordered rolling restarts are disabled. Pods are recreated
// from the StatefulSetSpec when they are manually deleted.
OnDeleteStatefulSetStrategyType = "OnDelete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a reason to not support Recreate, call it out in this proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnDelete might be somewhat unfamiliar as well. We could potentially call it Replacement or ReplacementUpdate (since we are only recreating when a pod needs tobe replaced).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RollingUpdate and OnDelete are for consistency with DaemonSet. These are already implemented. I don't see a strong reason to make them inconsistent here.

// all possible update strategies for the StatefulSet controller.
type StatefulSetUpdateStrategyType string

const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strategies should define their behavior on scale up/down during an update in the godoc

a StatefulSet, as proposed [here](https://github.com/kubernetes/community/pull/541),
the tenant application has no way to determine if it is being terminated due to
a scale down operation or due to an update.
Consider a BASE distributed storage application like Cassandra, where 2 TiB of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline here to start a new para

@smarterclayton
Copy link
Contributor

No significant concerns left, except some fiddling with naming. Since updates will be alpha in 1.7 we should clearly communicate that in Godoc during implementation and that we reserve the right to change the names.

@smarterclayton
Copy link
Contributor

This is LGTM. If anyone has remaining comments on this or hasn't reviewed the latest and wants to chime in, please do so by EOD May 18th. This looks like a workable step.

Type StatefulSetUpdateStrategyType
// Partition is used to communicate the ordinal at which to partition
// the StatefulSet when Type is PartitionStatefulSetStrategyType. This
// value must be set when Type is PartitionStatefulSetStrategyType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My undestanding was that by unsetting this we could allow the rollout to proceed like a normal Rolling rollout. Isn't that the case?

// UpdatedReplicas is the number of Pods created by the StatefulSet
// controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated
// by UpdateRevision.
UpdatedReplicas int32 `json:"taretReplicas,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json tag needs to be updated

entry point commands or parameters, or configuration files.
- As the administrator of the logging and monitoring infrastructure for my
organization, in order to add logging and monitoring side cars, I want to patch
containers to add images.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/"I want to patch containers to add images."/"I want to patch Pods to add containers."/

UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty`

// RevisionHistoryLimit is the maximum number of revisions that will
// be maintained in the StatefulSet's revision history. The revision history

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the controller history proposal, this would be the number of non-live revisions (so the entire history including live revisions could exceed this limit, if I understand correctly).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the "revision history" as described by the RevisionHistoryLimit contains only non-live revisions.

CurrentRevision string `json:"currentRevision,omitempty"`

// UpdateRevision, if not empty, indicates the version of PodSpecTemplate,
// VolumeClaimsTemplate tuple used to generate Pods in the sequence

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: alignment is off on the comments for these fields.

// CurrentReplicas is the number of Pods created by the StatefulSet
// controller from the PodTemplateSpec, VolumeClaimsTemplate tuple indicated
// by CurrentRevision.
CurrentReplicas int32 `json:"currentReplicas,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here is kind of ambiguous -- one might interpret CurrentReplicas as the current number of replicas in the stateful set, instead of the number of replicas at the CurrentRevision. Maybe something like ReplicasAtCurrentRevision and ReplicasAtUpdateRevision would be clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have had multiple discussions over the naming and have landed on this for consistency with DaemonSet. I don't see a reason to change the naming yet again.

1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`,
the Pod should be consistent with version indicated by `Status.CurrentRevision`.
1. If the Pod's ordinal is in the sequence
`[.Status.Replicas - .Status.UpdatedReplicas, .Status.Replicas)`the Pod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing space between ` and "the"

`RollingUpdateStatefulSetStrategyType` then the version of the Pod should be
as follows.
1. If the Pod's ordinal is in the sequence `[0,.Status.CurrentReplicas)`,
the Pod should be consistent with version indicated by `Status.CurrentRevision`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a rolling update, isn't the target pod state for all pods in [0,.Spec.Replicas) the UpdateRevision?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like "target pod state" is being overloaded here -- I interpret it as the goal state for the rolling update, but the description here seems to suggest that it is referring to the pod's point in time state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target state for Pods needs to be consistent with respect to the StatefulSet invariants. Target Pod state refers to the desired state of a Pod during an individual iteration of the control loop. Target state referrers to the desired state the controller wishes to evolve the system too.

#### History Maintenance
In order to prevent the revision history of the StatefulSet from exceeding
memory or storage limits, the StatefulSet controller will periodically prune
its revision history so that no more that `.Spec.RevisionHisotryLimit` non-live

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/Hisotry/History/

`PartitionStatefulSetStrategyType`, the API Server should fail validation
if any of the following conditions are true.
1. `.Spec.UpdateStrategy.Partition` is nil.
1. `.Spec.UpdateStratgegy.Parition` is not nil, and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/UpdateStratgegy.Parition/UpdateStrategy.Partition/

### Canaries
Users can create a canary using `kubectl apply`. The only difference between a
[rolling update](#rolling-out-an-update) and a canary is that the
`.Spec.UpdateStrategy.Type` is set to `ParitionedStatefulSetStrategyType` and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/Paritioned/Partitioned/

1. If all Pods in the sequence `[0,.Spec.Replicas)` have been created, but if any
do not have a Ready Condition, the StatefulSet controller will wait for these
Pods to either become Ready, or to be completely deleted.
1. If all Pods in the sequence `[0,.Spec.Replicas)` have a Ready Condition, and

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plans for support of minReadySeconds?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maxUnavailable to allow more than one replica to be taken down at once?

Copy link
Contributor

@0xmichalis 0xmichalis May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general option than maxUnavailable - covers scaling and node drains too - is added in kubernetes/kubernetes#44899


1. The controller will increment the `.Status.ObservedGeneration` to communicate
the `.Generation` of the StatefulSet object that was observed.
1. The controller will set the `.Status.Replicas` to the current number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section somewhere and mention that we are breaking this field with this proposal - see kubernetes/kubernetes#42410

Copy link

@krmayankk krmayankk May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis @kow3ns is there a good reason for breaking this ? Can we just stick to the same meaning as the rest of the workload controllers ? I think there is already ambiguity in the meaning of these fields and we should NOT add to that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning will be the same with the rest of the controllers, it hasn't been so far.

@kow3ns
Copy link
Member Author

kow3ns commented May 26, 2017

I'm going to merge COB PDT unless there are any objections.

@smarterclayton
Copy link
Contributor

None from me.

@smarterclayton smarterclayton merged commit 41346e4 into kubernetes:master May 27, 2017
@0xmichalis
Copy link
Contributor

One last issue I see in this proposal is the way the completion of an update is signaled. Specifically, the last step:

The controller will set `.Status.UpdatedReplicas` to 0.

UpdatedReplicas means the replicas that have been updated to the latest revision, setting it to zero means that there are no updated replicas. By doing this to signal completion, we are basically having an API field that means different things in different contexts, i.e., the number of updated replicas during an update and it just signals completion during stable state. We don't do this anywhere else in our APIs and I feel it's an anti-pattern.

@0xmichalis
Copy link
Contributor

0xmichalis commented Jun 2, 2017

I would expect the completion of an update to be a simple equation:

sts.spec.replicas == sts.status.replicas &&
sts.spec.replicas == sts.status.updatedReplicas &&
sts.spec.replicas == sts.status.readyReplicas

We can have Conditions in a follow-up to communicate completion in a more human-friendly way.

@0xmichalis
Copy link
Contributor

Also, I see more reasons in having a single strategy that includes partitioning as an option. See 1. https://github.com/kubernetes/kubernetes/pull/46669/files#r119898826 and 2. how complex the validation ends up

@kubernetes/sig-apps-api-reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.