kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

yujuhong · 2018-05-14T19:02:10Z

Updated 2020-11-09 by @liggitt:

In 1.16+, the kubelet hashes the serialized container spec:

API server sends data

Defaults for added fields will modify this data
If an alpha field that is not defaulted graduates to adding a default in beta, an API server upgrade from alpha->beta will modify this data

kubernetes/pkg/apis/core/v1/defaults_test.go

Lines 165 to 167 in 5f1e5ca

    
           // TestPodDefaults detects changes to defaults within PodSpec. 
        
           // Defaulting changes within this type (*especially* within containers) can cause kubelets to restart containers on API server update. 
        
           func TestPodDefaults(t *testing.T) {

Kubelet decodes that data into its v1.Container struct
- If the kubelet is older than the API server, this means it will drop fields added in newer API server versions

Kubelet encodes that v1.Container struct to JSON and hashes that data

kubernetes/pkg/kubelet/container/helpers.go

Lines 98 to 108 in 5f1e5ca

    
           // HashContainer returns the hash of the container. It is used to compare 
        
           // the running container with its desired spec. 
        
           // Note: remember to update hashValues in container_hash_test.go as well. 
        
           func HashContainer(container *v1.Container) uint64 { 
        
           	hash := fnv.New32a() 
        
           	// Omit nil or empty field when calculating hash value 
        
           	// Please see https://github.com/kubernetes/kubernetes/issues/53644 
        
           	containerJSON, _ := json.Marshal(container) 
        
           	hashutil.DeepHashObject(hash, containerJSON) 
        
           	return uint64(hash.Sum32()) 
        
           }

This mechanism is used by the kubelet in containerChanged and getStableKey, which means an API server upgrade that starts defaulting a container field graduating from alpha will instantly modify the computed hash and trigger a container restart. This results in simultaneous restart of all containers across all pods of a replicaset and can cause outages.

The kubelet should limit the fields it considers in the container hash to the ones it expects to be able to change in ways that would affect the running container (currently just image, I think). The pod uid is already included in getStableKey so a delete/recreate of the entire pod would already compute a different key.

Changing this mechanism on a minor version boundary (to improve the interactions with the API server changes in 1) is acceptable, because pods are expected to be drained as part of that upgrade (xref kubernetes/website#12326 (comment))

Original description:

Kubelet hashes the container spec and store it with the container so that it can detect whether the container meets the current spec. If the hashes do not match, kubelet would kill the container and create a new one. The most notable side effect of this mechanism is that if a new field is added to the container spec in the 1.x kubernetes version, and kubelet is upgraded to 1.x from 1.x-1 version, all containers created by the 1.x-1 version kubelet will be killed.

This issue has come up quiet a few time before (see #53644), and discussed extensively in #23104. It was punted a few times because 1) in-place upgrade of kubelet was not officially supported, and 2) upgrading docker used to require restarting all the containers too. Now that upgrade is managed by the installers (#6099 (comment)) and docker supports live-restore, we should consider fixing this in kubelet.

There are related discussions/suggestions the #57741 PR already. Creating this issue as a centralized place for discussions and suggestions.

/cc @kubernetes/sig-node-feature-requests
/cc a couple of folks who have expressed interests in this issue: @dixudx @kevin-wangzefeng @wackxu

The text was updated successfully, but these errors were encountered:

lichuqiang · 2018-05-15T09:35:03Z

As @lavalamp suggested in #23104, instead of hash the whole container spec, can we maintain a list of container fields of different releases that should be taken into consideration during hash, and identify the hash values with release prefix?
The field list may looks like:

release1: {field1, field2...fieldN}
release2: {field1, field2...fieldN, fieldN+1}

and imagine the kubelet has been upgraded to release2 (new container field fieldN+1 added), when walking through the pods, if it found a container has a hash with a prefix of "release1", then it knows the container should be hashed with the fields of release1, thus, the result hash would not change.

And if the users have an update on a pod or its deployment, then the pod is expected to be recreated, thus, the fields of new release begin to take effect on its containers.

The only downside I can think of is that we might need to maintain the field list manually (e.g deprecate the field of old releases).

dixudx · 2018-05-15T14:13:01Z

can we maintain a list of container fields of different releases that should be taken into consideration during hash

That means we have to hard code this. I don't like this way.

As I mentioned in #57741, how about adding a new label, like containerSerializedSpecLabel , with backwards compatibility supported as well.

kevin-wangzefeng · 2018-05-16T07:56:17Z

As I mentioned in #57741, how about adding a new label, like containerSerializedSpecLabel, with backwards compatibility supported as well.

We don't have field level versioning for kubernetes api today, so it won't help (kubelet itself has no idea which version a field is introduced or changed).

If we have to go with the field label approach, adding a version indicator might help:
For example

type Container struct {
    foo string `containerSerializedSpecLabel:"since v1.9"`
    bar string `containerSerializedSpecLabel:"since v1.10"`
}

Then, foo will be included for checking container hash like "v1.9-xxxxxxxx"
foo and bar will be included for checking container hash like "v1.10-xxxxxxxx"

TBO, I don't see many benefits than having a container fields list const some where in pkg/kubelet,(e.g. in pkg/kubelet/container/helpers.go)

bgrant0607 · 2018-05-28T15:42:18Z

Another problem with comparing hashes across versions is changes to the hash function.

Similar issues have come up in workload controllers, where they are used for collision avoidance only, for these types of reasons.

Persisted hashes shouldn't be used as the sole determination of equality, since there are multiple reasons why they may change across releases.

If a hash comparison fails, one needs to fall back to actually comparing properties that matter.

cc @janetkuo @kow3ns

lavalamp · 2018-05-29T20:46:28Z

You can solve it with hashes, but you need to hash whitelisted fields (NOT the encoded object), store every iteration of the hash function, and understand a vector of hashes (i.e., one label per iteration of the hash function) instead of a single hash.

…

On Mon, May 28, 2018 at 8:43 AM Brian Grant ***@***.***> wrote: Another problem with comparing hashes across versions is changes to the hash function. Similar issues have come up in workload controllers, where they are used for collision avoidance only, for these types of reasons. Persisted hashes shouldn't be used as the sole determination of equality, since there are multiple reasons why they may change across releases. If a hash comparison fails, one needs to fall back to actually comparing properties that matter. cc @janetkuo <https://github.com/janetkuo> @kow3ns <https://github.com/kow3ns> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63814 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglo3JlEchD7mu1tgs0OzGGjzN30OOks5t3BsagaJpZM4T-Vj8> .

fejta-bot · 2018-08-27T20:55:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-09-26T21:17:52Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

wgliang · 2018-12-24T01:56:28Z

/remove-lifecycle stale

cuericlee · 2019-07-02T11:53:36Z

In order to fix this issue, I suppose PR #57741 help to omit nil/empty filed from now on, even new filed was added in the future. I will prevent restarting container continues. we need to make decision to close this kind of issue since 1.8.x -> 1.9.x, 1.11.x -> 1.12.x and so on.
It does not help for migration immediately to omit nil/empty filed, but if it go into 1.16.x from now on, it will help on future upgrade from 1.16.x -> 1.17.x and so on.

Crazykev · 2020-04-14T03:11:07Z

Is this a proper time to re-consider this issue, find a one-time fix?

We still struggle from container restart when trying to upgrade our 1.15.x cluster, backport #57741 won't help since itself would bring a hash change.
And #57741 can't deal with new added filed with default value(and maybe delete desperated field in the furture). Nil default proposed in kubernetes/community#4571 may help fix this issue forever with #57741, but it seems discussion moved to another approach(use generation or server side apply), which can't apply in this situation.

containerSerializedSpecLabel approach @liggitt @dixudx proposed in #57741 (comment) make sense to me. Backwards compatibility can be done with serialize the current container spec and deserialize into a legacy container spec which we previously stored for compatibility. This will guarantee a smooth upgrade.

cc @thockin who may help with some input.

liggitt · 2020-11-09T14:30:15Z

Changing the kubelet mechanism on a minor version boundary is acceptable, because pods are expected to be drained on kubelet minor version upgrade.

liggitt · 2021-01-14T15:00:54Z

Updated the title and description to clarify the impact of the current logic and why it should be changed

xref discussion in #95587 (comment)

liggitt · 2021-01-14T16:46:44Z

my naive recommendation would be to limit the container hash to specific fields that are known to be immutable or are expected to be able to change (maybe just name and image?)

lavalamp · 2021-01-14T16:58:12Z

limit the container hash to specific fields that are known to be immutable or are expected to be able to change

✅ 💯

If kubelet doesn't know how to react to a field change, it must not be included in the hash.

In-place upgrades add some restrictions:

The hash must be versioned so that a future kubelet which knows how to react to additional field changes doesn't do an unnecessary restart. (it has to be able to reproduce the hashes of the old kubelet.)
A new hash version must not be used until there's no chance of rolling back to an old kubelet version which doesn't understand it.

(the above goes for controllers that perform this operation, too)

maybe just name and image

And resource limits, if those would require a restart.

liggitt · 2021-01-14T20:10:38Z

In-place upgrades add some restrictions:

The hash must be versioned so that a future kubelet which knows how to react to additional field changes doesn't do an unnecessary restart. (it has to be able to reproduce the hashes of the old kubelet.)

A new hash version must not be used until there's no chance of rolling back to an old kubelet version which doesn't understand it.

Specifically for kubelets, if the change is made on a minor version boundary, I don't think we need those protections, given sig-node's position that in-place (minor version) kubelet upgrades without draining is not supported.

(the above goes for controllers that perform this operation, too)

Yes, for controllers, which have to deal with persisted ControllerRevisions and annotations on API objects made by earlier versions of the controller, they definitely have to consider the approach taken by previous versions of the controller. I plan to spawn similar issues to track resolving this for controllers

k8s-triage-robot · 2023-03-03T16:00:35Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

liggitt · 2023-03-17T01:12:15Z

it looks like the in-place resource change expanded the surface area of this issue in f2bd94a by adding a second hash that is tracked now - HashWithoutResources

I had hoped handling mutable resources would prompt fixing this issue :-/

liggitt · 2023-03-17T01:13:34Z

cc @smarterclayton @vinaykul for visibility to the problems with detecting container changes using hashes of API content

thockin · 2024-03-07T20:06:45Z

Revisiting this as we look at more and more Pod fields which cannot add default values. This forces us to treat nil as the default value, which means all code that consumes these fields has to know about this. It's frankly awful.

This is not unique to kubelet, it's just particularly acute.

We could convert this to a list of specific fields to hash, but that still feels brittle. Shouldn't the API server have a way to help with this? Either managedFields should have a way to say "this was defaulted" or we should have a way to request the object without defaulted fields or we should have a way to know what the default values are and strip them out or something.

@jpbetz @deads2k

/sig api-machinery
/triage accepted

liggitt · 2024-03-07T20:10:48Z

We could convert this to a list of specific fields to hash, but that still feels brittle

The kubelet has to do stuff to set up containers correctly, it has to know which fields could change and how much it has to do to correctly handle the changes. Hashing the immutable fields that assert identity (name,namespace,uid,containerName) and the effective values of the fields that it handles mutations to seems correct to me.

thockin · 2024-03-07T20:25:04Z

Sure, but it's just another place that needs to be touched if any other field becomes mutable, which could be forgotten. I'm not saying not to fix this with urgency, just that I want to consider paths towards less open-coded lists of things to hash.

HirazawaUi · 2024-05-23T02:03:49Z

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue.
/close

k8s-ci-robot · 2024-05-23T02:03:54Z

@HirazawaUi: Closing this issue.

In response to this:

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vinaykul · 2024-05-26T18:42:07Z

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue. /close

@HirazawaUi If you have bandwidth, please run a manual upgrade test if this is not covered in CI somewhere. This change is important but also feels a bit risky.

HirazawaUi · 2024-06-11T16:05:41Z

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue. /close

@HirazawaUi If you have bandwidth, please run a manual upgrade test if this is not covered in CI somewhere. This change is important but also feels a bit risky.

I'll try to do this in my spare time :)

yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/feature Categorizes issue or PR as related to a new feature. labels May 14, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2018

yujuhong added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 26, 2018

cofyc mentioned this issue Dec 24, 2018

Upgrade kubelet from v1.11.3 to v.12.3 without draining pods causes containers to restart continuously #72296

Closed

liggitt mentioned this issue Nov 9, 2020

Consider defaulting logic in pod.Spec.SecurityContext.FSGroupChangePolicy #95587

Closed

liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 9, 2020

liggitt changed the title ~~Re-evaluate how kubelet determines whether a container meets the given spec~~ kubelet's calculation of whether a container has changed can cause global outages Jan 14, 2021

liggitt mentioned this issue May 25, 2022

KEP-2413: Graduate SeccompDefault feature to beta kubernetes/enhancements#3240

Merged

liggitt mentioned this issue Sep 29, 2022

KEP-24: Promote Apparmor to GA kubernetes/enhancements#3298

Closed

This was referenced Jan 30, 2023

Prevents automatic restart of static Pods #115156

Closed

KEP-3017: Promote unhealthyPodEvictionPolicy for PDBs to Beta kubernetes/enhancements#3777

Merged

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 3, 2023

tzneal mentioned this issue Sep 14, 2023

recommend draining the node before updating kubelet kubernetes/website#43053

Merged

LingyanYin mentioned this issue Sep 27, 2023

In-Place Update of Pod Resources kubernetes/enhancements#1287

Open

31 tasks

liggitt mentioned this issue Dec 2, 2023

Enabling InPlacePodVerticalScaling feature gate causes restart of containers within the cluster #122028

Closed

liggitt mentioned this issue Dec 11, 2023

InPlacePodVerticalScaling: Exclude ContainerResizePolicy from container hash calculation in kubelet #122255

Closed

thockin mentioned this issue Mar 7, 2024

KEP-3857: Recursive Read-only (RRO) mounts #123180

Merged

2 tasks

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2024

HirazawaUi mentioned this issue Apr 7, 2024

[kubelet]: fixed container restart due to pod spec field changes #124220

Merged

k8s-ci-robot closed this as completed May 23, 2024

kannon92 added this to SIG Node Bugs Jul 22, 2024

kannon92 moved this to Done in SIG Node Bugs Jul 22, 2024

liggitt mentioned this issue Jan 7, 2025

All containers restart after kubelet upgrade to v1.31 #129385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

yujuhong commented May 14, 2018 •

edited by liggitt

Loading

lichuqiang commented May 15, 2018 •

edited

Loading

dixudx commented May 15, 2018 •

edited

Loading

kevin-wangzefeng commented May 16, 2018

bgrant0607 commented May 28, 2018

lavalamp commented May 29, 2018 via email

fejta-bot commented Aug 27, 2018

fejta-bot commented Sep 26, 2018

wgliang commented Dec 24, 2018

cuericlee commented Jul 2, 2019

Crazykev commented Apr 14, 2020

liggitt commented Nov 9, 2020

liggitt commented Jan 14, 2021 •

edited

Loading

liggitt commented Jan 14, 2021

lavalamp commented Jan 14, 2021

liggitt commented Jan 14, 2021 •

edited

Loading

k8s-triage-robot commented Mar 3, 2023

liggitt commented Mar 17, 2023 •

edited

Loading

liggitt commented Mar 17, 2023

thockin commented Mar 7, 2024

liggitt commented Mar 7, 2024

thockin commented Mar 7, 2024

HirazawaUi commented May 23, 2024

k8s-ci-robot commented May 23, 2024

vinaykul commented May 26, 2024

HirazawaUi commented Jun 11, 2024

kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

Comments

yujuhong commented May 14, 2018 • edited by liggitt Loading

lichuqiang commented May 15, 2018 • edited Loading

dixudx commented May 15, 2018 • edited Loading

kevin-wangzefeng commented May 16, 2018

bgrant0607 commented May 28, 2018

lavalamp commented May 29, 2018 via email

fejta-bot commented Aug 27, 2018

fejta-bot commented Sep 26, 2018

wgliang commented Dec 24, 2018

cuericlee commented Jul 2, 2019

Crazykev commented Apr 14, 2020

liggitt commented Nov 9, 2020

liggitt commented Jan 14, 2021 • edited Loading

liggitt commented Jan 14, 2021

lavalamp commented Jan 14, 2021

liggitt commented Jan 14, 2021 • edited Loading

k8s-triage-robot commented Mar 3, 2023

liggitt commented Mar 17, 2023 • edited Loading

liggitt commented Mar 17, 2023

thockin commented Mar 7, 2024

liggitt commented Mar 7, 2024

thockin commented Mar 7, 2024

HirazawaUi commented May 23, 2024

k8s-ci-robot commented May 23, 2024

vinaykul commented May 26, 2024

HirazawaUi commented Jun 11, 2024

yujuhong commented May 14, 2018 •

edited by liggitt

Loading

lichuqiang commented May 15, 2018 •

edited

Loading

dixudx commented May 15, 2018 •

edited

Loading

liggitt commented Jan 14, 2021 •

edited

Loading

liggitt commented Jan 14, 2021 •

edited

Loading

liggitt commented Mar 17, 2023 •

edited

Loading