Skip to content

From 1.11, an extended resource reported by a device plugin can be left on a node after node upgrade even though its device plugin never registers back #64632

Closed
@jiayingz

Description

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:

In 1.11, #61877 removed externalID from Node.Spec which used to be used by kubelet to detect node recreation in certain environments during node upgrade. This changed node upgrade behavior in those environments because in pkg/kubelet/kubelet_node_status.go tryRegisterWithAPIServer(node *v1.Node), without externalID checking, we now consider the recreated node after upgrade as a previously registered node and thus will get old node state from API server instead of regenerating them based on node local state.

For extended resource reported by a device plugin, this causes problem because after node upgrade/recreation, the device plugin DaemonSet pod needs to be restarted by Kubelet, finishes certain setup, and then report to kubelet that its resource is available at the node. However, the node status capacity and allocatable from the API server still contains the old state before the node upgrade. After kubelet syncs up with the API server, the node status field gets overwritten and previously reported extended resources will appear on the node status capacity/allocatable even though the node is not ready for pods to consume such resources. As the result, Kubelet may start a pod requesting such resource without proper container runtime setup it needs to learn from device plugin.

Note to cope with the externalID removal change, cluster/gce/upgrade.sh has been updated (#63506) to explicitly uncordon the node after checking it is restarted and becomes ready. This however doesn't help device plugin case because node readiness doesn't depend on individual extended resource readiness.

This issue is to explore how we can fix this problem on head and 1.11.

In particular, I wonder whether we may switch to the model that we consider Kubelet as the only source to update extended resource capacity/allocatable in node status. I.e., don't support manually updating node status capacity/allocatable fields as documented in https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/. This would allow us to re-generate node status capacity/allocatable every time in node status update, which I think is much simpler and more robust. As I heard, some folks have been using this mechanism to do simple resource exporting/accounting through a central controller. I wonder whether people are open to switch to the device plugin model in those cases. Even though those extended resources may not require special container setup, it seems a more secure model to have Kubelet totally own its node status.

FYI, here is the related OSS issue: #50473 where we first introduced extended resource concept.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Running the gpu upgrade test from #63631 with --upgrade-target set to any 1.11+ version would hit the issue.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.sig/authCategorizes an issue or PR as relevant to SIG Auth.sig/cluster-lifecycleCategorizes an issue or PR as relevant to SIG Cluster Lifecycle.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions