-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814
Comments
As @lavalamp suggested in #23104, instead of hash the whole container spec, can we maintain a list of container fields of different releases that should be taken into consideration during hash, and identify the hash values with release prefix?
and imagine the kubelet has been upgraded to release2 (new container field And if the users have an update on a pod or its deployment, then the pod is expected to be recreated, thus, the fields of new release begin to take effect on its containers. The only downside I can think of is that we might need to maintain the field list manually (e.g deprecate the field of old releases). |
That means we have to hard code this. I don't like this way. As I mentioned in #57741, how about adding a new label, like |
We don't have field level versioning for kubernetes api today, so it won't help (kubelet itself has no idea which version a field is introduced or changed). If we have to go with the field label approach, adding a version indicator might help: type Container struct {
foo string `containerSerializedSpecLabel:"since v1.9"`
bar string `containerSerializedSpecLabel:"since v1.10"`
} Then, TBO, I don't see many benefits than having a container fields list const some where in pkg/kubelet,(e.g. in |
Another problem with comparing hashes across versions is changes to the hash function. Similar issues have come up in workload controllers, where they are used for collision avoidance only, for these types of reasons. Persisted hashes shouldn't be used as the sole determination of equality, since there are multiple reasons why they may change across releases. If a hash comparison fails, one needs to fall back to actually comparing properties that matter. |
You can solve it with hashes, but you need to hash whitelisted fields (NOT
the encoded object), store every iteration of the hash function, and
understand a vector of hashes (i.e., one label per iteration of the hash
function) instead of a single hash.
…On Mon, May 28, 2018 at 8:43 AM Brian Grant ***@***.***> wrote:
Another problem with comparing hashes across versions is changes to the
hash function.
Similar issues have come up in workload controllers, where they are used
for collision avoidance only, for these types of reasons.
Persisted hashes shouldn't be used as the sole determination of equality,
since there are multiple reasons why they may change across releases.
If a hash comparison fails, one needs to fall back to actually comparing
properties that matter.
cc @janetkuo <https://github.com/janetkuo> @kow3ns
<https://github.com/kow3ns>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63814 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglo3JlEchD7mu1tgs0OzGGjzN30OOks5t3BsagaJpZM4T-Vj8>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
In order to fix this issue, I suppose PR #57741 help to omit nil/empty filed from now on, even new filed was added in the future. I will prevent restarting container continues. we need to make decision to close this kind of issue since 1.8.x -> 1.9.x, 1.11.x -> 1.12.x and so on. |
Is this a proper time to re-consider this issue, find a one-time fix? We still struggle from container restart when trying to upgrade our 1.15.x cluster, backport #57741 won't help since itself would bring a hash change.
cc @thockin who may help with some input. |
Changing the kubelet mechanism on a minor version boundary is acceptable, because pods are expected to be drained on kubelet minor version upgrade. |
Updated the title and description to clarify the impact of the current logic and why it should be changed xref discussion in #95587 (comment) |
my naive recommendation would be to limit the container hash to specific fields that are known to be immutable or are expected to be able to change (maybe just name and image?) |
✅ 💯 If kubelet doesn't know how to react to a field change, it must not be included in the hash. In-place upgrades add some restrictions:
(the above goes for controllers that perform this operation, too)
And resource limits, if those would require a restart. |
Specifically for kubelets, if the change is made on a minor version boundary, I don't think we need those protections, given sig-node's position that in-place (minor version) kubelet upgrades without draining is not supported.
Yes, for controllers, which have to deal with persisted ControllerRevisions and annotations on API objects made by earlier versions of the controller, they definitely have to consider the approach taken by previous versions of the controller. I plan to spawn similar issues to track resolving this for controllers |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
it looks like the in-place resource change expanded the surface area of this issue in f2bd94a by adding a second hash that is tracked now - HashWithoutResources I had hoped handling mutable resources would prompt fixing this issue :-/ |
cc @smarterclayton @vinaykul for visibility to the problems with detecting container changes using hashes of API content |
Revisiting this as we look at more and more Pod fields which cannot add default values. This forces us to treat This is not unique to kubelet, it's just particularly acute. We could convert this to a list of specific fields to hash, but that still feels brittle. Shouldn't the API server have a way to help with this? Either /sig api-machinery |
The kubelet has to do stuff to set up containers correctly, it has to know which fields could change and how much it has to do to correctly handle the changes. Hashing the immutable fields that assert identity (name,namespace,uid,containerName) and the effective values of the fields that it handles mutations to seems correct to me. |
Sure, but it's just another place that needs to be touched if any other field becomes mutable, which could be forgotten. I'm not saying not to fix this with urgency, just that I want to consider paths towards less open-coded lists of things to hash. |
After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue. |
@HirazawaUi: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@HirazawaUi If you have bandwidth, please run a manual upgrade test if this is not covered in CI somewhere. This change is important but also feels a bit risky. |
I'll try to do this in my spare time :) |
Updated 2020-11-09 by @liggitt:
In 1.16+, the kubelet hashes the serialized container spec:
kubernetes/pkg/apis/core/v1/defaults_test.go
Lines 165 to 167 in 5f1e5ca
kubernetes/pkg/kubelet/container/helpers.go
Lines 98 to 108 in 5f1e5ca
This mechanism is used by the kubelet in
containerChanged
andgetStableKey
, which means an API server upgrade that starts defaulting a container field graduating from alpha will instantly modify the computed hash and trigger a container restart. This results in simultaneous restart of all containers across all pods of a replicaset and can cause outages.The kubelet should limit the fields it considers in the container hash to the ones it expects to be able to change in ways that would affect the running container (currently just
image
, I think). The pod uid is already included ingetStableKey
so a delete/recreate of the entire pod would already compute a different key.Changing this mechanism on a minor version boundary (to improve the interactions with the API server changes in 1) is acceptable, because pods are expected to be drained as part of that upgrade (xref kubernetes/website#12326 (comment))
Original description:
Kubelet hashes the container spec and store it with the container so that it can detect whether the container meets the current spec. If the hashes do not match, kubelet would kill the container and create a new one. The most notable side effect of this mechanism is that if a new field is added to the container spec in the 1.x kubernetes version, and kubelet is upgraded to 1.x from 1.x-1 version, all containers created by the 1.x-1 version kubelet will be killed.
This issue has come up quiet a few time before (see #53644), and discussed extensively in #23104. It was punted a few times because 1) in-place upgrade of kubelet was not officially supported, and 2) upgrading docker used to require restarting all the containers too. Now that upgrade is managed by the installers (#6099 (comment)) and docker supports live-restore, we should consider fixing this in kubelet.
There are related discussions/suggestions the #57741 PR already. Creating this issue as a centralized place for discussions and suggestions.
/cc @kubernetes/sig-node-feature-requests
/cc a couple of folks who have expressed interests in this issue: @dixudx @kevin-wangzefeng @wackxu
The text was updated successfully, but these errors were encountered: