Refine the Deployment proposal and move away from hashing #384

0xmichalis · 2017-02-19T13:11:28Z

Alternative proposal to #477

@kubernetes/sig-apps-api-reviews @kubernetes/api-reviewers

0xmichalis · 2017-02-20T18:00:52Z

contributors/design-proposals/deployment.md

+existing Pods and create the new ReplicaSet that will adopt any existing Pods and
+create additional Pods as necessary.
+
+**TODO:** Non-cascading deletion as part of rollback sounds brittle. Replacing


This is the real pain point of ending up with duplicate ReplicaSets (RSs that have the same PodTemplateSpec).

There are tradeoffs that need to be made:

we either continue using the current mechanism which works like a charm for rollbacks but Deployments are broken across cluster upgrades.

we switch to use TemplateGeneration which means stable names across cluster upgrades at the expense of more cumbersome rollbacks.

The first option seems broken by design and I can't think of a way to fix this (maybe hash the YAML/JSON represantation of a PodTemplate - seems more stable than the Go struct but stable enough?), the second option can be worked up (we should be able to make handoff/renaming of a controller more robust).

I updated the proposal to rollback TemplateGeneration so the comments above do not stand anymore.

Scenario:
Deployment with name=foo, revision=5, templateGeneration=3 is rolled back to revision=4

Deployment controller lists all rss owned by foo, finds foo-2 tagged with revision=4, templateGeneration=2

Deployment foo is updated to use template from foo-2, templateGeneration rolled back to 2, revision now is 6. foo-3 is listed as an old replica set.

Deployment foo is updated by the user to a net-new template, templateGeneration is incremented to 3.

Deployment controller sees that a replica set with foo-3 name already exists, checks to see if foo and foo-3 have the same PodTemplateSpec. If yes, then fine. If not, list all existing replica sets and decide what is the new templateGeneration by taking the highest one and +1 to it (similar to how we currently estimate the new revision). New revision is 7.

Scenario:
Deployment with name=foo, revision=5, templateGeneration=1 is rolled back to revision=4

Deployment controller lists all rss owned by foo, finds foo-4fg56d tagged with revision=4, no templateGeneration

Deployment foo is updated to use the template from foo-4fg56d, templateGeneration is incremented normally to 2, foo-2 is created.

If foo-4fg56d has running pods, we will simply kill them while creating new pods for foo-2 (we could potentially adopt them but I don't think it's necessary).

Replica set names are still tentative (I prefer something like foo-blabla rather than foo-2 but we can't use the existing scheme because of possible naming collisions with old replica sets).

0xmichalis · 2017-02-21T09:58:39Z

contributors/design-proposals/deployment.md

+     append the TemplateGeneration to the Deployment name or we can compute the hash of
+     *deployment.name+deployment.templategeneration* and extend its size in different ways to
+     ensure that a new RS name will never collide with an old one.
+     Since owner references have been added in Pods since 1.4, we should be able to stop adding


Thinking about this more, now I think not using the extra label is dangerous because the Pods owned by a ReplicaSet will be claimable by all existing ReplicaSets under a Deployment in the event of a non-cascading deletion of the owner ReplicaSet.

0xmichalis · 2017-02-27T23:30:52Z

Thinking about how TemplateGeneration is used in DaemonSets, I have updated the proposal with what I think as an ok solution for using TemplateGeneration for Deployments and migrating away from hashing PodTemplateSpec.

0xmichalis · 2017-03-06T10:21:54Z

Thinking about how TemplateGeneration is used in DaemonSets, I have updated the proposal with what I think as an ok solution for using TemplateGeneration for Deployments and migrating away from hashing PodTemplateSpec.

Idea:

Make the rollback endpoint unset TemplateGeneration iff the revision we are rolling back to has a pod-template-generation.
Make the Deployment controller set the TemplateGeneration on a Deployment based on pod-template-generation of the matching ReplicaSet. If there is no matching ReplicaSet, list all, and set TemplateGeneration to the highest found + 1.
The only problem with this flow, is what happens when a user rolls back from foo-6 to foo-5 and his next update is different from what's in foo-6. Should we rewrite foo-6 or advance foo to 7?

janetkuo · 2017-03-07T23:24:51Z

@Kargakis thanks for the proposal. It looks great, and I have some thoughts on this:

If I understand this proposal correctly, we want to replace template hash with templateGeneration, mainly for the following reasons:

Hash collisions (which cause replica set names and label/selector collisions)
Hash value changed across Kubernetes versions

Upgrading Kubernetes

When we upgrade from a previous Kubernetes version, we need to do something about the old Deployments without templateGeneration:

First, update deployment.spec.templateGeneration=N
Add label to new RS template.labels: templateGeneration=N
Add label to new RS’s pods’ labels: templateGeneration=N
Add label to new RS’s labels and selectors: templateGeneration=N
Remove label from new RS’s labels and selectors: template-hash=xxx
Remove label from new RS’s template.labels: template-hash=xxx
Remove label from new RS’s pods’labels: template-hash=xxx
Do these for other RSes of the deployment, except that templateGeneration != N and should be unique

We need to remove template-hash labels so that the hash collision won’t affect us.
N can be calculated from the total number of RSes, and each RS has a unique number from 1~N. We can sort RSes by timestamp.

Adoption

If there are only bare pods to be adopted, we just adopt them blindly (based on labels, since we can’t compare template against pod spec) by creating a new RS with label, template label, and selector templateGeneration=deployment.spec.templateGeneration. Then label the pod with the same label. (Updated: Deployments don't deal with bare pods today and is out of scope in this proposal)
If there are only one bare RS and some of its pods, we adopt them by doing step 1~4 in the upgrading Kubernetes case where N = 1
If there are more than one RSes, also doing step 1~4 in the upgrading Kubernetes case

Rollback

As described in the proposal, every time we rollback, we’ll increase templateGeneration by 1, and we’ll need to update new RS label/selector/template label, as well as its pods’ labels.

kubectl version skew

Since we’ll still use revisions for rollback, kubectl should continue to work.

Naming of RSes and pods

Having 2 naming styles is a bit ugly, but I like its simplicity.
If it's required to rename RSes that were created before templateGeneration is introduced, we can delete them non-cascadingly and then recreate them with a different name. However, we can’t rename their pods since that’ll cause unnecessary restarts.

Other things to consider

We should not allow users to update templateGeneration
Make sure that templateGeneration won’t collide (no two RSes from the same Deployment will have the same templateGeneration label value at the same time)
Will templateGeneration and revision cause user confusion?
This change will complicate deployment...

0xmichalis · 2017-03-09T10:05:17Z

@janetkuo thanks for the comment

If there are only bare pods to be adopted, we just adopt them blindly (based on labels, since we can’t compare template against pod spec) by creating a new RS with label, template label, and selector templateGeneration=deployment.spec.templateGeneration. Then label the pod with the same label.

Deployments so far don't deal with bare pods and I don't want to deal with this problem in this proposal (probably deserves a separate discussion).

The only problem with switching to TemplateGeneration is that we make adoption harder because the adopted RS will need to match to a TemplateGeneration number in order to separate its Pods from other ReplicaSets owned by the Deployment. Owner references help in that regard but don't necessarily solve the issue[1] so I think we will always need the additional label. How about instead of labeling by using pod-template-generation we switch to use the ReplicaSet name? This could be something that each RS is doing on its own, so we stop doing the relabeling we have been doing in the Deployment controller and get separate non-overlapping sets of labels for free. This probably overlaps with the selector generation proposal.

[1] If we relied just on owner references, orphaning would be a nightmare because all the existing history of a Deployment would be able to adopt any orphaned pods and I am not sure this is desirable hence we will always need the additional label.

janetkuo · 2017-03-24T19:02:23Z

Thanks @Kargakis, I read both proposal (this and #477). Using templateGeneration is more seamless from user's perspective but complicates the system; on the other hand, switching hashing algorithm seems to be much less complex, but the users need to manually migrate to use the new hash. I’m leaning a bit towards using the new hash, but I’m not sure if the new hash will bring us new problems. Do you have a preference between the two?

I have another idea, and would love to hear your thoughts. How about avoiding the hash collision by just appending some suffix upon collision? Do you think this would work?

Situation:

The known problem is hash collisions. Hash value is used as ReplicaSet selector and name, so it has to be unique per template.
We can’t avoid comparing templates of ReplicaSets.

Solution:

Make hash value “unique” by appending a suffix on collision. The suffix can be:
- The hash of timestamp
- An incremental number
- A random string
That “unique” hash is used as ReplicaSet selector and name.
We only need this when the controller tries to create a new ReplicaSet.

If this works, we don't need migration, and we don't need to add more complexity to the controller.

0xmichalis · 2017-03-24T19:35:28Z

We can’t avoid comparing templates of ReplicaSets.

This is the only source of truth for knowing the new RS for a deployment so I don't think we want to avoid it:)

I think I prefer using the new hashing algorithm because adoption is simpler. With templateGeneration we can still end up running a RS with a name other than the expected because the RS may be adopted. The fact that the hash can change between releases wasn't a problem up to the point we broke the wait of the controller.

Make hash value “unique” by appending a suffix on collision.

I am not sure this is any less complex to achieve (we now have to programmatically find out hash collisions) vs staging a path for migrating to a stable hashing algorithm. The migration step is really temporary and we can force admins to run it between releases.

janetkuo · 2017-03-24T19:49:34Z

We can’t avoid comparing templates of ReplicaSets.

This is the only source of truth for knowing the new RS for a deployment so I don't think we want to avoid it:)

Yeah this is a statement of current situation rather than a problem we want to solve.

soltysh · 2017-03-28T11:49:04Z

contributors/design-proposals/deployment.md

  // created with this selector, with a unique label `pod-template-hash`.
  // If Selector is empty, it is defaulted to the labels present on the Pod template.
  Selector map[string]string

+  // A counter that tracks the number of times an update happened in the PodTemplateSpec
+  // of the Deployment, similarly to how metadata.Generation tracks updates in the Spec
+  // of all first-class API objects. Will be able to be rolled back to a previous or a


One will be able to roll back ...

soltysh · 2017-03-28T11:53:02Z

contributors/design-proposals/deployment.md

+### TemplateGeneration
+
+Hashing an API object such as PodTemplateSpec means that the resulting hash is subject to
+change due to API changes in PodTemplateSpec between minor versions of Kubernetes when new


in PodTemplateSpec (including referenced objects)

Last time we had the problem it wasn't with the PTS being changed, but rather the referenced object that has changed.

Referenced object? Every addition in the PTS changes the hash.

soltysh · 2017-03-28T11:55:23Z

contributors/design-proposals/deployment.md

+metadata.Generation, this field can be initialized by users in order to facilitate adoption
+of existing ReplicaSets. This is similar to how it is already used by DaemonSets.
+
+TemplateGeneration is not authoritative and only helps in constructing the name for the new


Shouldn't that be: "help in constructing new RS" ?

It'll be used only for the name.

soltysh · 2017-03-28T11:58:32Z

contributors/design-proposals/deployment.md

+ReplicaSet in case there is no other ReplicaSet that matches the Deployment. The Deployment
+controller still decides what is the new ReplicaSet by comparing PodTemplateSpecs.
+If no matching ReplicaSet is found, the controller will try to create a new ReplicaSet using
+its current TemplateGeneration.


This paragraph is pretty confusing, it's talking about naming and and resources at the same time. Here, you're saying you're using TemplateGeneration as part of the RS name, but in the next paragraph you say it can't be used, because we cannot migrate old RS to use it.

TemplateGeneration can't be used for migrating old ReplicaSets. This paragraph talks about what will happen to new ReplicaSets if we switch to TemplateGeneration. I will try to make it clearer.

soltysh · 2017-03-28T12:01:21Z

contributors/design-proposals/deployment.md

+the TemplateGeneration to the Deployment name or we can compute the hash of
+*deployment.name+deployment.templategeneration*, extend its size deterministically in different
+ways to ensure that a new RS name will never collide with an old one, and use that as the hash
+that shall be appended in the Deployment name to produce the new ReplicaSet name.


Since we are introducing TemplateGeneration, and that is meant to address problems when comparing if the deployment currently running is the one in Deployment spec, why not just use whatever name (ie. unix time at the moment of creating that RS, similarly to what we do with CronJobs creating Jobs). Although I see value in using TG in the name we are not relying on it, why is it, in that case?

Actually the reason we want to switch to TemplateGeneration is stable names. There is no such problem as "comparing if the deployment currently running is the one in Deployment spec". That is already working properly.

soltysh · 2017-03-28T12:02:06Z

contributors/design-proposals/deployment.md

-For ex: consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
+Users can cancel a rollout by doing a non-cascading deletion of the Deployment
+it before it is complete. Recreating the same Deployment will resume it.


it before it is ...

soltysh · 2017-03-28T12:04:10Z

contributors/design-proposals/deployment.md

+Users can cancel a rollout by doing a non-cascading deletion of the Deployment
+it before it is complete. Recreating the same Deployment will resume it.
+For example, consider the following case:
+- User creates a Deployment to rolling-update 10 pods with image:v1 to


User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to image:v2.

soltysh · 2017-03-28T12:05:01Z

contributors/design-proposals/deployment.md

-notice that the second RC exists already which it can ramp up while ramping down
+- User then deletes this Deployment while the old and new RSs are at 5 replicas each.
+  User will end up with 2 RSs with 5 replicas each.
+User can then create the same Deployment again in which case, DeploymentController will


soltysh · 2017-03-28T12:06:56Z

contributors/design-proposals/deployment.md

+revision they want to rollback to or no revision which means that the Deployment
+will be rolled back to its previous revision.
+
+By hashing PodTemplateSpec, the Deployment controller can move the ReplicaSet a


That's not clear to me, who/when and how will hash PodTemplateSpec?

who: the Deployment controller
when: every time a Deployment is processed and there is no existing ReplicaSet matching the desired PodTemplateSpec
how: the controller gets the PodTemplateSpec out of the Deployment, hashes it, and uses the hash to create the new ReplicaSet.

soltysh · 2017-03-28T12:08:53Z

I intentionally didn't comment on #477 since I think this is better approach and it should be pursued.

0xmichalis · 2017-04-05T21:41:08Z

@janetkuo I thought about this more and after reviewing #503 I think we want to move forward with TemplateGeneration for Deployments too. I thought this out and the pros outweigh the cons:

No more hash collisions
We don't need to stage a migration
Consistent with the rest of the workload APIs

By following what @kow3ns is proposing in #503:

CurrentTemplateGeneration and UpdatedTemplateGeneration help us in:

stop using the revision annotation which has been a PITA to maintain. I always wanted to move it in the API for validation. I think we want to map the annotation to UpdatedTemplateGeneration.
Combined with CurrentTemplateGeneration we can gate more the rollout process which will simplify a lot the implementation for ProgressDeadlineSeconds (right now it's ugly).

We can and should use GenerationPartition from #503 as well but in Deployments it can also be a percentage. We will need some math here for calculating rollout percentages but it should be fun to write.

0xmichalis · 2017-04-05T21:43:14Z

We can and should use GenerationPartition from #503

For canary, custom, and A/B deployments

cc: @mfojtik @smarterclayton

janetkuo · 2017-04-07T18:33:56Z

stop using the revision annotation which has been a PITA to maintain.

@Kargakis Would you elaborate more on why it's a PITA?

Just thought about this issue more. I don't think the migration step is necessary, if we switch to the new / more stable hash function. After switching to the new hash, no unnecessary new RS will be created, since we'll look at current RS template before creating a new one and we don't compare hash value directly.

0xmichalis · 2017-04-07T19:27:45Z

@Kargakis Would you elaborate more on why it's a PITA?

See fixes like kubernetes/kubernetes#42535 or kubernetes/kubernetes#42869

Just thought about this issue more. I don't think the migration step is necessary, if we switch to the new / more stable hash function. After switching to the new hash, no unnecessary new RS will be created, since we'll look at current RS template before creating a new one and we don't compare hash value directly.

What if a hash produced with the new hashing algo collides with a hash produced with the old hashing algo?

janetkuo · 2017-04-07T21:46:21Z

What if a hash produced with the new hashing algo collides with a hash produced with the old hashing algo?

True, but can we guarantee that the new hashing algo will never collide? If not, we can uniquify the hash value when a collision happens. It should also be idempotent so that the Deployment controller won’t create more new RS than required.

So we can do this by:

First, switch to the new hash algo (without migration), and then
Add a new field to Deployment: uniquifier (tentative name, perhaps under status)
- This value will only increase
- This value will be updated when there’s hash collision
When the Deployment template is updated (either rollback or rollout):
- The Deployment controller first searches for current RSes and compares their template with the deployment’s (current behavior)
- If found a match, that RS is the new RS (current behavior)
- If no matches found, create a new RS, with hash{template+uniquifier} (use this hash value for RS name and label selectors and pod labels)
  - If the creation failed because the RS already exists, check the existing RS template with Deployment template to determine if it’s a collision or it’s a new RS the controller just created (but haven’t observed the creation)
  - If the new RS is already created, no op (mission completed)
  - If it’s a collision, increase uniquifier by 1, and try again in the next loop
  - If the creation failed for any other reasons, the controller will retry in the next loop, and the hash value will be idempotent because we use the same template and the same uniquifier

Please let me know what you think about this.

0xmichalis · 2017-04-08T13:43:32Z

@janetkuo I like the idea but the uniquifier now changes the generated names every time a hash collision is detected, which means that it also increases the probability for hash collisions. I think it is very hard to programmatically know when a hash collision has happened w/o making the underlying ReplicaSets read-only otherwise they may change by users in subtle ways that is impossible for us to detect. Read-only children is a problem on its own though and I am going to open a separate issue for discussion.

0xmichalis · 2017-04-10T16:14:28Z

Read-only children issue: kubernetes/kubernetes#44237

0xmichalis · 2017-04-13T20:28:53Z

contributors/design-proposals/deployment.md

-     this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
-   - If the RCs and pods dont already have this label and selector:
-     - We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
+2. The new RS can have the same selector as the old RS and hence we need to add a unique label


@soltysh ptal

0xmichalis · 2017-04-13T20:31:51Z

Updated the proposal with my latest thoughts on selector generation for ReplicaSets and rollbacks.

soltysh · 2017-04-19T10:18:37Z

contributors/design-proposals/deployment.md

-     this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
-   - If the RCs and pods dont already have this label and selector:
-     - We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
+2. The new RS can have the same selector as the old RS and hence we need to add a unique label


soltysh · 2017-04-19T10:24:26Z

contributors/design-proposals/deployment.md

+Users can cancel a rollout by doing a non-cascading deletion of the Deployment
+before it is complete. Recreating the same Deployment will resume it.
+For example, consider the following case:
+- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to


s/creats/creates

0xmichalis · 2017-04-20T16:20:03Z

Closing in favor of #477

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 19, 2017

This was referenced Feb 19, 2017

Federated Deployment Design Doc #325

Closed

Add template generation to DaemonSet upgrade proposal #366

Merged

0xmichalis commented Feb 20, 2017

View reviewed changes

0xmichalis commented Feb 21, 2017

View reviewed changes

This was referenced Feb 27, 2017

Upgrading to v1.5.2 creates duplicate replica sets kubernetes/kubernetes#40415

Closed

[WIP] Change hashing algorithm in Deployments kubernetes/kubernetes#38714

Closed

Error syncing deployment, replica set already exists kubernetes/kubernetes#29735

Closed

bgrant0607 assigned janetkuo Mar 8, 2017

0xmichalis mentioned this pull request Mar 9, 2017

Deployment graduation requirements to v1/GA kubernetes/kubernetes#20130

Closed

1 task

bgrant0607 mentioned this pull request Mar 10, 2017

Workload API v1 requirements umbrella issue kubernetes/kubernetes#42752

Closed

0xmichalis mentioned this pull request Mar 23, 2017

Refine the Deployment proposal and switch hashing algorithm #477

Merged

0xmichalis changed the title ~~Refine the Deployment proposal and address the hashing issue~~ Refine the Deployment proposal and move away from hashing Mar 23, 2017

soltysh reviewed Mar 28, 2017

View reviewed changes

Refine the Deployment proposal

191ac10

0xmichalis mentioned this pull request Apr 6, 2017

initial StatefulSet updates proposal #503

Merged

Add TemplateGeneration in Deployments

d9c5dff

0xmichalis commented Apr 13, 2017

View reviewed changes

soltysh reviewed Apr 19, 2017

View reviewed changes

0xmichalis closed this Apr 20, 2017

0xmichalis deleted the update-deployment-proposal branch April 20, 2017 16:20

Refine the Deployment proposal and move away from hashing #384

Refine the Deployment proposal and move away from hashing #384

Conversation

0xmichalis commented Feb 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis Feb 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Feb 27, 2017

0xmichalis commented Mar 6, 2017

janetkuo commented Mar 7, 2017 • edited Loading

Upgrading Kubernetes

Adoption

Rollback

kubectl version skew

Naming of RSes and pods

Other things to consider

0xmichalis commented Mar 9, 2017

janetkuo commented Mar 24, 2017

0xmichalis commented Mar 24, 2017

janetkuo commented Mar 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Mar 28, 2017

0xmichalis commented Apr 5, 2017 • edited Loading

0xmichalis commented Apr 5, 2017

janetkuo commented Apr 7, 2017

0xmichalis commented Apr 7, 2017

janetkuo commented Apr 7, 2017 • edited Loading

0xmichalis commented Apr 8, 2017

0xmichalis commented Apr 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Apr 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented Apr 20, 2017

0xmichalis commented Feb 19, 2017 •

edited

Loading

0xmichalis Feb 28, 2017 •

edited

Loading

janetkuo commented Mar 7, 2017 •

edited

Loading

0xmichalis commented Apr 5, 2017 •

edited

Loading

janetkuo commented Apr 7, 2017 •

edited

Loading