Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet's calculation of whether a container has changed can cause cluster-wide outages #63814

Closed
yujuhong opened this issue May 14, 2018 · 41 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@yujuhong
Copy link
Contributor

yujuhong commented May 14, 2018

Updated 2020-11-09 by @liggitt:

In 1.16+, the kubelet hashes the serialized container spec:

  1. API server sends data
    • Defaults for added fields will modify this data
    • If an alpha field that is not defaulted graduates to adding a default in beta, an API server upgrade from alpha->beta will modify this data
    • // TestPodDefaults detects changes to defaults within PodSpec.
      // Defaulting changes within this type (*especially* within containers) can cause kubelets to restart containers on API server update.
      func TestPodDefaults(t *testing.T) {
  2. Kubelet decodes that data into its v1.Container struct
    • If the kubelet is older than the API server, this means it will drop fields added in newer API server versions
  3. Kubelet encodes that v1.Container struct to JSON and hashes that data
    • // HashContainer returns the hash of the container. It is used to compare
      // the running container with its desired spec.
      // Note: remember to update hashValues in container_hash_test.go as well.
      func HashContainer(container *v1.Container) uint64 {
      hash := fnv.New32a()
      // Omit nil or empty field when calculating hash value
      // Please see https://github.com/kubernetes/kubernetes/issues/53644
      containerJSON, _ := json.Marshal(container)
      hashutil.DeepHashObject(hash, containerJSON)
      return uint64(hash.Sum32())
      }

This mechanism is used by the kubelet in containerChanged and getStableKey, which means an API server upgrade that starts defaulting a container field graduating from alpha will instantly modify the computed hash and trigger a container restart. This results in simultaneous restart of all containers across all pods of a replicaset and can cause outages.

The kubelet should limit the fields it considers in the container hash to the ones it expects to be able to change in ways that would affect the running container (currently just image, I think). The pod uid is already included in getStableKey so a delete/recreate of the entire pod would already compute a different key.

Changing this mechanism on a minor version boundary (to improve the interactions with the API server changes in 1) is acceptable, because pods are expected to be drained as part of that upgrade (xref kubernetes/website#12326 (comment))


Original description:

Kubelet hashes the container spec and store it with the container so that it can detect whether the container meets the current spec. If the hashes do not match, kubelet would kill the container and create a new one. The most notable side effect of this mechanism is that if a new field is added to the container spec in the 1.x kubernetes version, and kubelet is upgraded to 1.x from 1.x-1 version, all containers created by the 1.x-1 version kubelet will be killed.

This issue has come up quiet a few time before (see #53644), and discussed extensively in #23104. It was punted a few times because 1) in-place upgrade of kubelet was not officially supported, and 2) upgrading docker used to require restarting all the containers too. Now that upgrade is managed by the installers (#6099 (comment)) and docker supports live-restore, we should consider fixing this in kubelet.

There are related discussions/suggestions the #57741 PR already. Creating this issue as a centralized place for discussions and suggestions.

/cc @kubernetes/sig-node-feature-requests
/cc a couple of folks who have expressed interests in this issue: @dixudx @kevin-wangzefeng @wackxu

@yujuhong yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/feature Categorizes issue or PR as related to a new feature. labels May 14, 2018
@lichuqiang
Copy link
Contributor

lichuqiang commented May 15, 2018

As @lavalamp suggested in #23104, instead of hash the whole container spec, can we maintain a list of container fields of different releases that should be taken into consideration during hash, and identify the hash values with release prefix?
The field list may looks like:

release1: {field1, field2...fieldN}
release2: {field1, field2...fieldN, fieldN+1}

and imagine the kubelet has been upgraded to release2 (new container field fieldN+1 added), when walking through the pods, if it found a container has a hash with a prefix of "release1", then it knows the container should be hashed with the fields of release1, thus, the result hash would not change.

And if the users have an update on a pod or its deployment, then the pod is expected to be recreated, thus, the fields of new release begin to take effect on its containers.

The only downside I can think of is that we might need to maintain the field list manually (e.g deprecate the field of old releases).

@dixudx
Copy link
Member

dixudx commented May 15, 2018

can we maintain a list of container fields of different releases that should be taken into consideration during hash

That means we have to hard code this. I don't like this way.

As I mentioned in #57741, how about adding a new label, like containerSerializedSpecLabel , with backwards compatibility supported as well.

@kevin-wangzefeng
Copy link
Member

As I mentioned in #57741, how about adding a new label, like containerSerializedSpecLabel, with backwards compatibility supported as well.

We don't have field level versioning for kubernetes api today, so it won't help (kubelet itself has no idea which version a field is introduced or changed).

If we have to go with the field label approach, adding a version indicator might help:
For example

type Container struct {
    foo string `containerSerializedSpecLabel:"since v1.9"`
    bar string `containerSerializedSpecLabel:"since v1.10"`
}

Then, foo will be included for checking container hash like "v1.9-xxxxxxxx"
foo and bar will be included for checking container hash like "v1.10-xxxxxxxx"

TBO, I don't see many benefits than having a container fields list const some where in pkg/kubelet,(e.g. in pkg/kubelet/container/helpers.go)

@bgrant0607
Copy link
Member

Another problem with comparing hashes across versions is changes to the hash function.

Similar issues have come up in workload controllers, where they are used for collision avoidance only, for these types of reasons.

Persisted hashes shouldn't be used as the sole determination of equality, since there are multiple reasons why they may change across releases.

If a hash comparison fails, one needs to fall back to actually comparing properties that matter.

cc @janetkuo @kow3ns

@lavalamp
Copy link
Member

lavalamp commented May 29, 2018 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 26, 2018
@yujuhong yujuhong added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Sep 26, 2018
@wgliang
Copy link
Contributor

wgliang commented Dec 24, 2018

/remove-lifecycle stale

@cuericlee
Copy link
Contributor

In order to fix this issue, I suppose PR #57741 help to omit nil/empty filed from now on, even new filed was added in the future. I will prevent restarting container continues. we need to make decision to close this kind of issue since 1.8.x -> 1.9.x, 1.11.x -> 1.12.x and so on.
It does not help for migration immediately to omit nil/empty filed, but if it go into 1.16.x from now on, it will help on future upgrade from 1.16.x -> 1.17.x and so on.

@Crazykev
Copy link
Contributor

Is this a proper time to re-consider this issue, find a one-time fix?

We still struggle from container restart when trying to upgrade our 1.15.x cluster, backport #57741 won't help since itself would bring a hash change.
And #57741 can't deal with new added filed with default value(and maybe delete desperated field in the furture). Nil default proposed in kubernetes/community#4571 may help fix this issue forever with #57741, but it seems discussion moved to another approach(use generation or server side apply), which can't apply in this situation.

containerSerializedSpecLabel approach @liggitt @dixudx proposed in #57741 (comment) make sense to me. Backwards compatibility can be done with serialize the current container spec and deserialize into a legacy container spec which we previously stored for compatibility. This will guarantee a smooth upgrade.

cc @thockin who may help with some input.

@liggitt
Copy link
Member

liggitt commented Nov 9, 2020

Changing the kubelet mechanism on a minor version boundary is acceptable, because pods are expected to be drained on kubelet minor version upgrade.

@liggitt liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 9, 2020
@liggitt liggitt changed the title Re-evaluate how kubelet determines whether a container meets the given spec kubelet's calculation of whether a container has changed can cause global outages Jan 14, 2021
@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 14, 2021
@liggitt
Copy link
Member

liggitt commented Jan 14, 2021

Updated the title and description to clarify the impact of the current logic and why it should be changed

xref discussion in #95587 (comment)

@liggitt
Copy link
Member

liggitt commented Jan 14, 2021

my naive recommendation would be to limit the container hash to specific fields that are known to be immutable or are expected to be able to change (maybe just name and image?)

@lavalamp
Copy link
Member

limit the container hash to specific fields that are known to be immutable or are expected to be able to change

✅ 💯

If kubelet doesn't know how to react to a field change, it must not be included in the hash.

In-place upgrades add some restrictions:

  • The hash must be versioned so that a future kubelet which knows how to react to additional field changes doesn't do an unnecessary restart. (it has to be able to reproduce the hashes of the old kubelet.)
  • A new hash version must not be used until there's no chance of rolling back to an old kubelet version which doesn't understand it.

(the above goes for controllers that perform this operation, too)

maybe just name and image

And resource limits, if those would require a restart.

@liggitt
Copy link
Member

liggitt commented Jan 14, 2021

In-place upgrades add some restrictions:

  • The hash must be versioned so that a future kubelet which knows how to react to additional field changes doesn't do an unnecessary restart. (it has to be able to reproduce the hashes of the old kubelet.)
  • A new hash version must not be used until there's no chance of rolling back to an old kubelet version which doesn't understand it.

Specifically for kubelets, if the change is made on a minor version boundary, I don't think we need those protections, given sig-node's position that in-place (minor version) kubelet upgrades without draining is not supported.

(the above goes for controllers that perform this operation, too)

Yes, for controllers, which have to deal with persisted ControllerRevisions and annotations on API objects made by earlier versions of the controller, they definitely have to consider the approach taken by previous versions of the controller. I plan to spawn similar issues to track resolving this for controllers

@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 3, 2023
@liggitt
Copy link
Member

liggitt commented Mar 17, 2023

it looks like the in-place resource change expanded the surface area of this issue in f2bd94a by adding a second hash that is tracked now - HashWithoutResources

I had hoped handling mutable resources would prompt fixing this issue :-/

@liggitt
Copy link
Member

liggitt commented Mar 17, 2023

cc @smarterclayton @vinaykul for visibility to the problems with detecting container changes using hashes of API content

@thockin
Copy link
Member

thockin commented Mar 7, 2024

Revisiting this as we look at more and more Pod fields which cannot add default values. This forces us to treat nil as the default value, which means all code that consumes these fields has to know about this. It's frankly awful.

This is not unique to kubelet, it's just particularly acute.

We could convert this to a list of specific fields to hash, but that still feels brittle. Shouldn't the API server have a way to help with this? Either managedFields should have a way to say "this was defaulted" or we should have a way to request the object without defaulted fields or we should have a way to know what the default values are and strip them out or something.

@jpbetz @deads2k

/sig api-machinery
/triage accepted

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2024
@liggitt
Copy link
Member

liggitt commented Mar 7, 2024

We could convert this to a list of specific fields to hash, but that still feels brittle

The kubelet has to do stuff to set up containers correctly, it has to know which fields could change and how much it has to do to correctly handle the changes. Hashing the immutable fields that assert identity (name,namespace,uid,containerName) and the effective values of the fields that it handles mutations to seems correct to me.

@thockin
Copy link
Member

thockin commented Mar 7, 2024

Sure, but it's just another place that needs to be touched if any other field becomes mutable, which could be forgotten. I'm not saying not to fix this with urgency, just that I want to consider paths towards less open-coded lists of things to hash.

@HirazawaUi
Copy link
Contributor

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue.
/close

@k8s-ci-robot
Copy link
Contributor

@HirazawaUi: Closing this issue.

In response to this:

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vinaykul
Copy link
Member

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue. /close

@HirazawaUi If you have bandwidth, please run a manual upgrade test if this is not covered in CI somewhere. This change is important but also feels a bit risky.

@HirazawaUi
Copy link
Contributor

After PR #124220 is merged, I think this issue should be closed. If there are still any problems, we can reopen this issue. /close

@HirazawaUi If you have bandwidth, please run a manual upgrade test if this is not covered in CI somewhere. This change is important but also feels a bit risky.

I'll try to do this in my spare time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Done
Development

No branches or pull requests