Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the Deployment proposal and move away from hashing #384

Closed
wants to merge 2 commits into from
Closed

Refine the Deployment proposal and move away from hashing #384

wants to merge 2 commits into from

Conversation

0xmichalis
Copy link
Contributor

@0xmichalis 0xmichalis commented Feb 19, 2017

Alternative proposal to #477

@kubernetes/sig-apps-api-reviews @kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 19, 2017
existing Pods and create the new ReplicaSet that will adopt any existing Pods and
create additional Pods as necessary.

**TODO:** Non-cascading deletion as part of rollback sounds brittle. Replacing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real pain point of ending up with duplicate ReplicaSets (RSs that have the same PodTemplateSpec).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tradeoffs that need to be made:

  • we either continue using the current mechanism which works like a charm for rollbacks but Deployments are broken across cluster upgrades.
  • we switch to use TemplateGeneration which means stable names across cluster upgrades at the expense of more cumbersome rollbacks.

The first option seems broken by design and I can't think of a way to fix this (maybe hash the YAML/JSON represantation of a PodTemplate - seems more stable than the Go struct but stable enough?), the second option can be worked up (we should be able to make handoff/renaming of a controller more robust).

Copy link
Contributor Author

@0xmichalis 0xmichalis Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the proposal to rollback TemplateGeneration so the comments above do not stand anymore.

Scenario:
Deployment with name=foo, revision=5, templateGeneration=3 is rolled back to revision=4

  1. Deployment controller lists all rss owned by foo, finds foo-2 tagged with revision=4, templateGeneration=2
  2. Deployment foo is updated to use template from foo-2, templateGeneration rolled back to 2, revision now is 6. foo-3 is listed as an old replica set.
  3. Deployment foo is updated by the user to a net-new template, templateGeneration is incremented to 3.
  4. Deployment controller sees that a replica set with foo-3 name already exists, checks to see if foo and foo-3 have the same PodTemplateSpec. If yes, then fine. If not, list all existing replica sets and decide what is the new templateGeneration by taking the highest one and +1 to it (similar to how we currently estimate the new revision). New revision is 7.

Scenario:
Deployment with name=foo, revision=5, templateGeneration=1 is rolled back to revision=4

  1. Deployment controller lists all rss owned by foo, finds foo-4fg56d tagged with revision=4, no templateGeneration
  2. Deployment foo is updated to use the template from foo-4fg56d, templateGeneration is incremented normally to 2, foo-2 is created.
  3. If foo-4fg56d has running pods, we will simply kill them while creating new pods for foo-2 (we could potentially adopt them but I don't think it's necessary).

Replica set names are still tentative (I prefer something like foo-blabla rather than foo-2 but we can't use the existing scheme because of possible naming collisions with old replica sets).

append the TemplateGeneration to the Deployment name or we can compute the hash of
*deployment.name+deployment.templategeneration* and extend its size in different ways to
ensure that a new RS name will never collide with an old one.
Since owner references have been added in Pods since 1.4, we should be able to stop adding
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, now I think not using the extra label is dangerous because the Pods owned by a ReplicaSet will be claimable by all existing ReplicaSets under a Deployment in the event of a non-cascading deletion of the owner ReplicaSet.

@0xmichalis
Copy link
Contributor Author

Thinking about how TemplateGeneration is used in DaemonSets, I have updated the proposal with what I think as an ok solution for using TemplateGeneration for Deployments and migrating away from hashing PodTemplateSpec.

@0xmichalis
Copy link
Contributor Author

Thinking about how TemplateGeneration is used in DaemonSets, I have updated the proposal with what I think as an ok solution for using TemplateGeneration for Deployments and migrating away from hashing PodTemplateSpec.

Idea:

  • Make the rollback endpoint unset TemplateGeneration iff the revision we are rolling back to has a pod-template-generation.
  • Make the Deployment controller set the TemplateGeneration on a Deployment based on pod-template-generation of the matching ReplicaSet. If there is no matching ReplicaSet, list all, and set TemplateGeneration to the highest found + 1.
  • The only problem with this flow, is what happens when a user rolls back from foo-6 to foo-5 and his next update is different from what's in foo-6. Should we rewrite foo-6 or advance foo to 7?

@janetkuo
Copy link
Member

janetkuo commented Mar 7, 2017

@Kargakis thanks for the proposal. It looks great, and I have some thoughts on this:

If I understand this proposal correctly, we want to replace template hash with templateGeneration, mainly for the following reasons:

  • Hash collisions (which cause replica set names and label/selector collisions)
  • Hash value changed across Kubernetes versions

Upgrading Kubernetes

When we upgrade from a previous Kubernetes version, we need to do something about the old Deployments without templateGeneration:

  1. First, update deployment.spec.templateGeneration=N
  2. Add label to new RS template.labels: templateGeneration=N
  3. Add label to new RS’s pods’ labels: templateGeneration=N
  4. Add label to new RS’s labels and selectors: templateGeneration=N
  5. Remove label from new RS’s labels and selectors: template-hash=xxx
  6. Remove label from new RS’s template.labels: template-hash=xxx
  7. Remove label from new RS’s pods’labels: template-hash=xxx
  8. Do these for other RSes of the deployment, except that templateGeneration != N and should be unique

We need to remove template-hash labels so that the hash collision won’t affect us.
N can be calculated from the total number of RSes, and each RS has a unique number from 1~N. We can sort RSes by timestamp.

Adoption

  • If there are only bare pods to be adopted, we just adopt them blindly (based on labels, since we can’t compare template against pod spec) by creating a new RS with label, template label, and selector templateGeneration=deployment.spec.templateGeneration. Then label the pod with the same label. (Updated: Deployments don't deal with bare pods today and is out of scope in this proposal)
  • If there are only one bare RS and some of its pods, we adopt them by doing step 1~4 in the upgrading Kubernetes case where N = 1
  • If there are more than one RSes, also doing step 1~4 in the upgrading Kubernetes case

Rollback

As described in the proposal, every time we rollback, we’ll increase templateGeneration by 1, and we’ll need to update new RS label/selector/template label, as well as its pods’ labels.

kubectl version skew

Since we’ll still use revisions for rollback, kubectl should continue to work.

Naming of RSes and pods

Having 2 naming styles is a bit ugly, but I like its simplicity.
If it's required to rename RSes that were created before templateGeneration is introduced, we can delete them non-cascadingly and then recreate them with a different name. However, we can’t rename their pods since that’ll cause unnecessary restarts.

Other things to consider

  • We should not allow users to update templateGeneration
  • Make sure that templateGeneration won’t collide (no two RSes from the same Deployment will have the same templateGeneration label value at the same time)
  • Will templateGeneration and revision cause user confusion?
  • This change will complicate deployment...

@0xmichalis
Copy link
Contributor Author

@janetkuo thanks for the comment

If there are only bare pods to be adopted, we just adopt them blindly (based on labels, since we can’t compare template against pod spec) by creating a new RS with label, template label, and selector templateGeneration=deployment.spec.templateGeneration. Then label the pod with the same label.

Deployments so far don't deal with bare pods and I don't want to deal with this problem in this proposal (probably deserves a separate discussion).

The only problem with switching to TemplateGeneration is that we make adoption harder because the adopted RS will need to match to a TemplateGeneration number in order to separate its Pods from other ReplicaSets owned by the Deployment. Owner references help in that regard but don't necessarily solve the issue[1] so I think we will always need the additional label. How about instead of labeling by using pod-template-generation we switch to use the ReplicaSet name? This could be something that each RS is doing on its own, so we stop doing the relabeling we have been doing in the Deployment controller and get separate non-overlapping sets of labels for free. This probably overlaps with the selector generation proposal.

[1] If we relied just on owner references, orphaning would be a nightmare because all the existing history of a Deployment would be able to adopt any orphaned pods and I am not sure this is desirable hence we will always need the additional label.

@0xmichalis 0xmichalis changed the title Refine the Deployment proposal and address the hashing issue Refine the Deployment proposal and move away from hashing Mar 23, 2017
@janetkuo
Copy link
Member

Thanks @Kargakis, I read both proposal (this and #477). Using templateGeneration is more seamless from user's perspective but complicates the system; on the other hand, switching hashing algorithm seems to be much less complex, but the users need to manually migrate to use the new hash. I’m leaning a bit towards using the new hash, but I’m not sure if the new hash will bring us new problems. Do you have a preference between the two?

I have another idea, and would love to hear your thoughts. How about avoiding the hash collision by just appending some suffix upon collision? Do you think this would work?

Situation:

  1. The known problem is hash collisions. Hash value is used as ReplicaSet selector and name, so it has to be unique per template.
  2. We can’t avoid comparing templates of ReplicaSets.

Solution:

  1. Make hash value “unique” by appending a suffix on collision. The suffix can be:
    • The hash of timestamp
    • An incremental number
    • A random string
  2. That “unique” hash is used as ReplicaSet selector and name.
  3. We only need this when the controller tries to create a new ReplicaSet.

If this works, we don't need migration, and we don't need to add more complexity to the controller.

@0xmichalis
Copy link
Contributor Author

We can’t avoid comparing templates of ReplicaSets.

This is the only source of truth for knowing the new RS for a deployment so I don't think we want to avoid it:)

I think I prefer using the new hashing algorithm because adoption is simpler. With templateGeneration we can still end up running a RS with a name other than the expected because the RS may be adopted. The fact that the hash can change between releases wasn't a problem up to the point we broke the wait of the controller.

Make hash value “unique” by appending a suffix on collision.

I am not sure this is any less complex to achieve (we now have to programmatically find out hash collisions) vs staging a path for migrating to a stable hashing algorithm. The migration step is really temporary and we can force admins to run it between releases.

@janetkuo
Copy link
Member

We can’t avoid comparing templates of ReplicaSets.

This is the only source of truth for knowing the new RS for a deployment so I don't think we want to avoid it:)

Yeah this is a statement of current situation rather than a problem we want to solve.

// created with this selector, with a unique label `pod-template-hash`.
// If Selector is empty, it is defaulted to the labels present on the Pod template.
Selector map[string]string

// A counter that tracks the number of times an update happened in the PodTemplateSpec
// of the Deployment, similarly to how metadata.Generation tracks updates in the Spec
// of all first-class API objects. Will be able to be rolled back to a previous or a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One will be able to roll back ...

### TemplateGeneration

Hashing an API object such as PodTemplateSpec means that the resulting hash is subject to
change due to API changes in PodTemplateSpec between minor versions of Kubernetes when new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in PodTemplateSpec (including referenced objects)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time we had the problem it wasn't with the PTS being changed, but rather the referenced object that has changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referenced object? Every addition in the PTS changes the hash.

metadata.Generation, this field can be initialized by users in order to facilitate adoption
of existing ReplicaSets. This is similar to how it is already used by DaemonSets.

TemplateGeneration is not authoritative and only helps in constructing the name for the new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be: "help in constructing new RS" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be used only for the name.

ReplicaSet in case there is no other ReplicaSet that matches the Deployment. The Deployment
controller still decides what is the new ReplicaSet by comparing PodTemplateSpecs.
If no matching ReplicaSet is found, the controller will try to create a new ReplicaSet using
its current TemplateGeneration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is pretty confusing, it's talking about naming and and resources at the same time. Here, you're saying you're using TemplateGeneration as part of the RS name, but in the next paragraph you say it can't be used, because we cannot migrate old RS to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TemplateGeneration can't be used for migrating old ReplicaSets. This paragraph talks about what will happen to new ReplicaSets if we switch to TemplateGeneration. I will try to make it clearer.

the TemplateGeneration to the Deployment name or we can compute the hash of
*deployment.name+deployment.templategeneration*, extend its size deterministically in different
ways to ensure that a new RS name will never collide with an old one, and use that as the hash
that shall be appended in the Deployment name to produce the new ReplicaSet name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are introducing TemplateGeneration, and that is meant to address problems when comparing if the deployment currently running is the one in Deployment spec, why not just use whatever name (ie. unix time at the moment of creating that RS, similarly to what we do with CronJobs creating Jobs). Although I see value in using TG in the name we are not relying on it, why is it, in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the reason we want to switch to TemplateGeneration is stable names. There is no such problem as "comparing if the deployment currently running is the one in Deployment spec". That is already working properly.

For ex: consider the following case:
- User creates a deployment to rolling-update 10 pods with image:v1 to
Users can cancel a rollout by doing a non-cascading deletion of the Deployment
it before it is complete. Recreating the same Deployment will resume it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it before it is ...

Users can cancel a rollout by doing a non-cascading deletion of the Deployment
it before it is complete. Recreating the same Deployment will resume it.
For example, consider the following case:
- User creates a Deployment to rolling-update 10 pods with image:v1 to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to image:v2.

notice that the second RC exists already which it can ramp up while ramping down
- User then deletes this Deployment while the old and new RSs are at 5 replicas each.
User will end up with 2 RSs with 5 replicas each.
User can then create the same Deployment again in which case, DeploymentController will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-create

revision they want to rollback to or no revision which means that the Deployment
will be rolled back to its previous revision.

By hashing PodTemplateSpec, the Deployment controller can move the ReplicaSet a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not clear to me, who/when and how will hash PodTemplateSpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who: the Deployment controller
when: every time a Deployment is processed and there is no existing ReplicaSet matching the desired PodTemplateSpec
how: the controller gets the PodTemplateSpec out of the Deployment, hashes it, and uses the hash to create the new ReplicaSet.

@soltysh
Copy link
Contributor

soltysh commented Mar 28, 2017

I intentionally didn't comment on #477 since I think this is better approach and it should be pursued.

@0xmichalis
Copy link
Contributor Author

0xmichalis commented Apr 5, 2017

@janetkuo I thought about this more and after reviewing #503 I think we want to move forward with TemplateGeneration for Deployments too. I thought this out and the pros outweigh the cons:

  1. No more hash collisions
  2. We don't need to stage a migration
  3. Consistent with the rest of the workload APIs

By following what @kow3ns is proposing in #503:

CurrentTemplateGeneration and UpdatedTemplateGeneration help us in:

  1. stop using the revision annotation which has been a PITA to maintain. I always wanted to move it in the API for validation. I think we want to map the annotation to UpdatedTemplateGeneration.
  2. Combined with CurrentTemplateGeneration we can gate more the rollout process which will simplify a lot the implementation for ProgressDeadlineSeconds (right now it's ugly).

We can and should use GenerationPartition from #503 as well but in Deployments it can also be a percentage. We will need some math here for calculating rollout percentages but it should be fun to write.

@0xmichalis
Copy link
Contributor Author

We can and should use GenerationPartition from #503

For canary, custom, and A/B deployments

cc: @mfojtik @smarterclayton

@janetkuo
Copy link
Member

janetkuo commented Apr 7, 2017

stop using the revision annotation which has been a PITA to maintain.

@Kargakis Would you elaborate more on why it's a PITA?

Just thought about this issue more. I don't think the migration step is necessary, if we switch to the new / more stable hash function. After switching to the new hash, no unnecessary new RS will be created, since we'll look at current RS template before creating a new one and we don't compare hash value directly.

@0xmichalis
Copy link
Contributor Author

@Kargakis Would you elaborate more on why it's a PITA?

See fixes like kubernetes/kubernetes#42535 or kubernetes/kubernetes#42869

Just thought about this issue more. I don't think the migration step is necessary, if we switch to the new / more stable hash function. After switching to the new hash, no unnecessary new RS will be created, since we'll look at current RS template before creating a new one and we don't compare hash value directly.

What if a hash produced with the new hashing algo collides with a hash produced with the old hashing algo?

@janetkuo
Copy link
Member

janetkuo commented Apr 7, 2017

What if a hash produced with the new hashing algo collides with a hash produced with the old hashing algo?

True, but can we guarantee that the new hashing algo will never collide? If not, we can uniquify the hash value when a collision happens. It should also be idempotent so that the Deployment controller won’t create more new RS than required.

So we can do this by:

  1. First, switch to the new hash algo (without migration), and then
  2. Add a new field to Deployment: uniquifier (tentative name, perhaps under status)
    • This value will only increase
    • This value will be updated when there’s hash collision
  3. When the Deployment template is updated (either rollback or rollout):
    • The Deployment controller first searches for current RSes and compares their template with the deployment’s (current behavior)
    • If found a match, that RS is the new RS (current behavior)
    • If no matches found, create a new RS, with hash{template+uniquifier} (use this hash value for RS name and label selectors and pod labels)
      • If the creation failed because the RS already exists, check the existing RS template with Deployment template to determine if it’s a collision or it’s a new RS the controller just created (but haven’t observed the creation)
      • If the new RS is already created, no op (mission completed)
      • If it’s a collision, increase uniquifier by 1, and try again in the next loop
      • If the creation failed for any other reasons, the controller will retry in the next loop, and the hash value will be idempotent because we use the same template and the same uniquifier

Please let me know what you think about this.

@0xmichalis
Copy link
Contributor Author

@janetkuo I like the idea but the uniquifier now changes the generated names every time a hash collision is detected, which means that it also increases the probability for hash collisions. I think it is very hard to programmatically know when a hash collision has happened w/o making the underlying ReplicaSets read-only otherwise they may change by users in subtle ways that is impossible for us to detect. Read-only children is a problem on its own though and I am going to open a separate issue for discussion.

@0xmichalis
Copy link
Contributor Author

Read-only children issue: kubernetes/kubernetes#44237

this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
- If the RCs and pods dont already have this label and selector:
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
2. The new RS can have the same selector as the old RS and hence we need to add a unique label
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh ptal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@0xmichalis
Copy link
Contributor Author

Updated the proposal with my latest thoughts on selector generation for ReplicaSets and rollbacks.

this label. This value will be unique for all RCs, since PodTemplateSpec should be unique.
- If the RCs and pods dont already have this label and selector:
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to
2. The new RS can have the same selector as the old RS and hence we need to add a unique label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Users can cancel a rollout by doing a non-cascading deletion of the Deployment
before it is complete. Recreating the same Deployment will resume it.
For example, consider the following case:
- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/creats/creates

@0xmichalis
Copy link
Contributor Author

Closing in favor of #477

@0xmichalis 0xmichalis closed this Apr 20, 2017
@0xmichalis 0xmichalis deleted the update-deployment-proposal branch April 20, 2017 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants