-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine the Deployment proposal and move away from hashing #384
Refine the Deployment proposal and move away from hashing #384
Conversation
existing Pods and create the new ReplicaSet that will adopt any existing Pods and | ||
create additional Pods as necessary. | ||
|
||
**TODO:** Non-cascading deletion as part of rollback sounds brittle. Replacing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the real pain point of ending up with duplicate ReplicaSets (RSs that have the same PodTemplateSpec).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tradeoffs that need to be made:
- we either continue using the current mechanism which works like a charm for rollbacks but Deployments are broken across cluster upgrades.
- we switch to use TemplateGeneration which means stable names across cluster upgrades at the expense of more cumbersome rollbacks.
The first option seems broken by design and I can't think of a way to fix this (maybe hash the YAML/JSON represantation of a PodTemplate - seems more stable than the Go struct but stable enough?), the second option can be worked up (we should be able to make handoff/renaming of a controller more robust).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to rollback TemplateGeneration so the comments above do not stand anymore.
Scenario:
Deployment with name=foo, revision=5, templateGeneration=3 is rolled back to revision=4
- Deployment controller lists all rss owned by foo, finds foo-2 tagged with revision=4, templateGeneration=2
- Deployment foo is updated to use template from foo-2, templateGeneration rolled back to 2, revision now is 6. foo-3 is listed as an old replica set.
- Deployment foo is updated by the user to a net-new template, templateGeneration is incremented to 3.
- Deployment controller sees that a replica set with foo-3 name already exists, checks to see if foo and foo-3 have the same PodTemplateSpec. If yes, then fine. If not, list all existing replica sets and decide what is the new templateGeneration by taking the highest one and +1 to it (similar to how we currently estimate the new revision). New revision is 7.
Scenario:
Deployment with name=foo, revision=5, templateGeneration=1 is rolled back to revision=4
- Deployment controller lists all rss owned by foo, finds foo-4fg56d tagged with revision=4, no templateGeneration
- Deployment foo is updated to use the template from foo-4fg56d, templateGeneration is incremented normally to 2, foo-2 is created.
- If foo-4fg56d has running pods, we will simply kill them while creating new pods for foo-2 (we could potentially adopt them but I don't think it's necessary).
Replica set names are still tentative (I prefer something like foo-blabla rather than foo-2 but we can't use the existing scheme because of possible naming collisions with old replica sets).
append the TemplateGeneration to the Deployment name or we can compute the hash of | ||
*deployment.name+deployment.templategeneration* and extend its size in different ways to | ||
ensure that a new RS name will never collide with an old one. | ||
Since owner references have been added in Pods since 1.4, we should be able to stop adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, now I think not using the extra label is dangerous because the Pods owned by a ReplicaSet will be claimable by all existing ReplicaSets under a Deployment in the event of a non-cascading deletion of the owner ReplicaSet.
Thinking about how TemplateGeneration is used in DaemonSets, I have updated the proposal with what I think as an ok solution for using TemplateGeneration for Deployments and migrating away from hashing PodTemplateSpec. |
Idea:
|
@Kargakis thanks for the proposal. It looks great, and I have some thoughts on this: If I understand this proposal correctly, we want to replace template hash with
Upgrading KubernetesWhen we upgrade from a previous Kubernetes version, we need to do something about the old Deployments without templateGeneration:
We need to remove template-hash labels so that the hash collision won’t affect us. Adoption
RollbackAs described in the proposal, every time we rollback, we’ll increase templateGeneration by 1, and we’ll need to update new RS label/selector/template label, as well as its pods’ labels. kubectl version skewSince we’ll still use revisions for rollback, kubectl should continue to work. Naming of RSes and podsHaving 2 naming styles is a bit ugly, but I like its simplicity. Other things to consider
|
@janetkuo thanks for the comment
Deployments so far don't deal with bare pods and I don't want to deal with this problem in this proposal (probably deserves a separate discussion). The only problem with switching to TemplateGeneration is that we make adoption harder because the adopted RS will need to match to a TemplateGeneration number in order to separate its Pods from other ReplicaSets owned by the Deployment. Owner references help in that regard but don't necessarily solve the issue[1] so I think we will always need the additional label. How about instead of labeling by using [1] If we relied just on owner references, orphaning would be a nightmare because all the existing history of a Deployment would be able to adopt any orphaned pods and I am not sure this is desirable hence we will always need the additional label. |
Thanks @Kargakis, I read both proposal (this and #477). Using I have another idea, and would love to hear your thoughts. How about avoiding the hash collision by just appending some suffix upon collision? Do you think this would work? Situation:
Solution:
If this works, we don't need migration, and we don't need to add more complexity to the controller. |
This is the only source of truth for knowing the new RS for a deployment so I don't think we want to avoid it:) I think I prefer using the new hashing algorithm because adoption is simpler. With templateGeneration we can still end up running a RS with a name other than the expected because the RS may be adopted. The fact that the hash can change between releases wasn't a problem up to the point we broke the wait of the controller.
I am not sure this is any less complex to achieve (we now have to programmatically find out hash collisions) vs staging a path for migrating to a stable hashing algorithm. The migration step is really temporary and we can force admins to run it between releases. |
Yeah this is a statement of current situation rather than a problem we want to solve. |
// created with this selector, with a unique label `pod-template-hash`. | ||
// If Selector is empty, it is defaulted to the labels present on the Pod template. | ||
Selector map[string]string | ||
|
||
// A counter that tracks the number of times an update happened in the PodTemplateSpec | ||
// of the Deployment, similarly to how metadata.Generation tracks updates in the Spec | ||
// of all first-class API objects. Will be able to be rolled back to a previous or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One will be able to roll back ...
### TemplateGeneration | ||
|
||
Hashing an API object such as PodTemplateSpec means that the resulting hash is subject to | ||
change due to API changes in PodTemplateSpec between minor versions of Kubernetes when new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in PodTemplateSpec (including referenced objects)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time we had the problem it wasn't with the PTS being changed, but rather the referenced object that has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referenced object? Every addition in the PTS changes the hash.
metadata.Generation, this field can be initialized by users in order to facilitate adoption | ||
of existing ReplicaSets. This is similar to how it is already used by DaemonSets. | ||
|
||
TemplateGeneration is not authoritative and only helps in constructing the name for the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that be: "help in constructing new RS" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be used only for the name.
ReplicaSet in case there is no other ReplicaSet that matches the Deployment. The Deployment | ||
controller still decides what is the new ReplicaSet by comparing PodTemplateSpecs. | ||
If no matching ReplicaSet is found, the controller will try to create a new ReplicaSet using | ||
its current TemplateGeneration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is pretty confusing, it's talking about naming and and resources at the same time. Here, you're saying you're using TemplateGeneration as part of the RS name, but in the next paragraph you say it can't be used, because we cannot migrate old RS to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TemplateGeneration can't be used for migrating old ReplicaSets. This paragraph talks about what will happen to new ReplicaSets if we switch to TemplateGeneration. I will try to make it clearer.
the TemplateGeneration to the Deployment name or we can compute the hash of | ||
*deployment.name+deployment.templategeneration*, extend its size deterministically in different | ||
ways to ensure that a new RS name will never collide with an old one, and use that as the hash | ||
that shall be appended in the Deployment name to produce the new ReplicaSet name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are introducing TemplateGeneration, and that is meant to address problems when comparing if the deployment currently running is the one in Deployment spec, why not just use whatever name (ie. unix time at the moment of creating that RS, similarly to what we do with CronJobs creating Jobs). Although I see value in using TG in the name we are not relying on it, why is it, in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the reason we want to switch to TemplateGeneration is stable names. There is no such problem as "comparing if the deployment currently running is the one in Deployment spec". That is already working properly.
For ex: consider the following case: | ||
- User creates a deployment to rolling-update 10 pods with image:v1 to | ||
Users can cancel a rollout by doing a non-cascading deletion of the Deployment | ||
it before it is complete. Recreating the same Deployment will resume it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it before it is ...
Users can cancel a rollout by doing a non-cascading deletion of the Deployment | ||
it before it is complete. Recreating the same Deployment will resume it. | ||
For example, consider the following case: | ||
- User creates a Deployment to rolling-update 10 pods with image:v1 to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to image:v2.
notice that the second RC exists already which it can ramp up while ramping down | ||
- User then deletes this Deployment while the old and new RSs are at 5 replicas each. | ||
User will end up with 2 RSs with 5 replicas each. | ||
User can then create the same Deployment again in which case, DeploymentController will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-create
revision they want to rollback to or no revision which means that the Deployment | ||
will be rolled back to its previous revision. | ||
|
||
By hashing PodTemplateSpec, the Deployment controller can move the ReplicaSet a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not clear to me, who/when and how will hash PodTemplateSpec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who: the Deployment controller
when: every time a Deployment is processed and there is no existing ReplicaSet matching the desired PodTemplateSpec
how: the controller gets the PodTemplateSpec out of the Deployment, hashes it, and uses the hash to create the new ReplicaSet.
I intentionally didn't comment on #477 since I think this is better approach and it should be pursued. |
@janetkuo I thought about this more and after reviewing #503 I think we want to move forward with TemplateGeneration for Deployments too. I thought this out and the pros outweigh the cons:
By following what @kow3ns is proposing in #503: CurrentTemplateGeneration and UpdatedTemplateGeneration help us in:
We can and should use GenerationPartition from #503 as well but in Deployments it can also be a percentage. We will need some math here for calculating rollout percentages but it should be fun to write. |
For canary, custom, and A/B deployments |
@Kargakis Would you elaborate more on why it's a PITA? Just thought about this issue more. I don't think the migration step is necessary, if we switch to the new / more stable hash function. After switching to the new hash, no unnecessary new RS will be created, since we'll look at current RS template before creating a new one and we don't compare hash value directly. |
See fixes like kubernetes/kubernetes#42535 or kubernetes/kubernetes#42869
What if a hash produced with the new hashing algo collides with a hash produced with the old hashing algo? |
True, but can we guarantee that the new hashing algo will never collide? If not, we can uniquify the hash value when a collision happens. It should also be idempotent so that the Deployment controller won’t create more new RS than required. So we can do this by:
Please let me know what you think about this. |
@janetkuo I like the idea but the |
Read-only children issue: kubernetes/kubernetes#44237 |
this label. This value will be unique for all RCs, since PodTemplateSpec should be unique. | ||
- If the RCs and pods dont already have this label and selector: | ||
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to | ||
2. The new RS can have the same selector as the old RS and hence we need to add a unique label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soltysh ptal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Updated the proposal with my latest thoughts on selector generation for ReplicaSets and rollbacks. |
this label. This value will be unique for all RCs, since PodTemplateSpec should be unique. | ||
- If the RCs and pods dont already have this label and selector: | ||
- We will first add this to RC.PodTemplateSpec.Metadata.Labels for all RCs to | ||
2. The new RS can have the same selector as the old RS and hence we need to add a unique label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Users can cancel a rollout by doing a non-cascading deletion of the Deployment | ||
before it is complete. Recreating the same Deployment will resume it. | ||
For example, consider the following case: | ||
- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/creats/creates
Closing in favor of #477 |
Alternative proposal to #477
@kubernetes/sig-apps-api-reviews @kubernetes/api-reviewers