kubelet could reject pods with NodeAffinity error due to incomplete informer cache on the node object #129463
Description
What happened?
In our end (AKS), we observed that there is a case that kubelet could reject pods with NodeAffinty status if the node watch call got closed due to unknown reason during kubelet start up.
An illustrated procedure is (using Tx
to indicate time):
-
T1: kubelet starts, spawns the processes for registering node (create node) and node informer for syncing the node object in parallel
-
T2: the node informer successfully listed empty nodes list from kube-apiserver , proceeding to watch call
kubernetes/staging/src/k8s.io/client-go/tools/cache/reflector.go
Lines 401 to 409 in 6746df7
-
T3: the watch call stopped due to unknown issue, for instance network closed. In this case, the watch call would exit with error like:
very short watch: k8s.io/client-go/informers/factory.go:160: Unexpected watch close - watch lasted less than a second and no items received
In this case, the node informer will back-off and retry with new list-watch steps:
kubernetes/staging/src/k8s.io/client-go/tools/cache/reflector.go
Lines 336 to 340 in 6746df7
-
T4: kubelet successfully created the node object in kube-apiserver side. Kube-scheduler kicks in and assigns pods to the node
-
T5: because in T3 the list call was succeeded with empty results, the node informer had been marked as synced:
kubernetes/pkg/kubelet/kubelet.go
Lines 442 to 444 in 6746df7
kubernetes/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go
Lines 308 to 316 in 6746df7
In this case, kubelet starts syncing pods and attempts to run. If these pods are set with affinity settings, the kubelet node admission check will fail the pods with "NodeAffinity" status because the node informer cache is empty. Kubelet uses a minimal node object without expected node labels for performing admission check:
kubernetes/pkg/kubelet/kubelet_getters.go
Lines 291 to 298 in 6746df7
- T6: kubelet node informer retry kicks in, this time it successfully synced the node object from remote. Now retried pods scheduling on this node would succeed as the node labels are expected
In our reproduction, we captured the following logs:
T1: kubelet[3538]: I1225 01:02:59.216231 3538 kubelet.go:2346] "Starting kubelet main sync loop"
T2: kubelet[3538]: I1225 01:04:04.030571 3538 reflector.go:359] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:160
T3: kubelet[3538]: W1225 01:04:04.170608 3538 reflector.go:470] k8s.io/client-go/informers/factory.go:160: watch of *v1.Node ended with: very short watch: k8s.io/client-go/informers/factory.go:160: Unexpected watch close - watch lasted less than a second and no items received
T4: kubelet[3538]: I1225 01:04:08.309280 3538 kubelet_node_status.go:76] "Successfully registered node" node="<node-name>"
T5 (1): kubelet[3538]: I1225 01:04:29.348027 3538 predicate.go:164] "Predicate failed on Pod" pod="<namespace>/<pod>" err="Predicate NodeAffinity failed"
T5 (2): kubelet[3538]: E1225 01:04:29.351153 3538 kubelet_node_status.go:462] "Error getting the current node from lister" err="node \"<node-name>\" not found"
T6: kubelet[3538]: I1225 01:04:56.591752 3538 reflector.go:359] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:160
In the normal case, the retry back off (between T6 and T5) should be short and kubelet should not leave many pods failed. However, in rare scenarios that the node lister might be failing for some other reasons before the first successful call (waiting for TLS bootstrapping for example), and increased the back-off (max out to 30s), then in theory, the gap of T6-T5
could be as long as 30s + jitter, which aligns with the logs timestamps we observed in above.
What did you expect to happen?
Kubelet should back off the node list-watch calls until the node object has been populated in the node informer cache. Or, we should invalidate the node informer cache after successfully registering the node from kubelet to maintain the correct version of node object in memory.
How can we reproduce it (as minimally and precisely as possible)?
Simulate the watch closed call like mentioned above.
Anything else we need to know?
Related issues/PRs:
- pkg/kubelet: improve the node informer sync check #99336
- kubelet waits for node lister to sync at least once #94087
Kubernetes version
1.30 - but we believe this behavior is reproducible in all verisons.
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Metadata
Assignees
Labels
Type
Projects
Status
Triaged