From 1.11, an extended resource reported by a device plugin can be left on a node after node upgrade even though its device plugin never registers back #64632
Description
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened:
In 1.11, #61877 removed externalID from Node.Spec which used to be used by kubelet to detect node recreation in certain environments during node upgrade. This changed node upgrade behavior in those environments because in pkg/kubelet/kubelet_node_status.go tryRegisterWithAPIServer(node *v1.Node), without externalID checking, we now consider the recreated node after upgrade as a previously registered node and thus will get old node state from API server instead of regenerating them based on node local state.
For extended resource reported by a device plugin, this causes problem because after node upgrade/recreation, the device plugin DaemonSet pod needs to be restarted by Kubelet, finishes certain setup, and then report to kubelet that its resource is available at the node. However, the node status capacity and allocatable from the API server still contains the old state before the node upgrade. After kubelet syncs up with the API server, the node status field gets overwritten and previously reported extended resources will appear on the node status capacity/allocatable even though the node is not ready for pods to consume such resources. As the result, Kubelet may start a pod requesting such resource without proper container runtime setup it needs to learn from device plugin.
Note to cope with the externalID removal change, cluster/gce/upgrade.sh has been updated (#63506) to explicitly uncordon the node after checking it is restarted and becomes ready. This however doesn't help device plugin case because node readiness doesn't depend on individual extended resource readiness.
This issue is to explore how we can fix this problem on head and 1.11.
In particular, I wonder whether we may switch to the model that we consider Kubelet as the only source to update extended resource capacity/allocatable in node status. I.e., don't support manually updating node status capacity/allocatable fields as documented in https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/. This would allow us to re-generate node status capacity/allocatable every time in node status update, which I think is much simpler and more robust. As I heard, some folks have been using this mechanism to do simple resource exporting/accounting through a central controller. I wonder whether people are open to switch to the device plugin model in those cases. Even though those extended resources may not require special container setup, it seems a more secure model to have Kubelet totally own its node status.
FYI, here is the related OSS issue: #50473 where we first introduced extended resource concept.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Running the gpu upgrade test from #63631 with --upgrade-target set to any 1.11+ version would hit the issue.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others: