Skip to content

kubelet could reject pods with NodeAffinity error due to incomplete informer cache on the node object #129463

@bcho

Description

What happened?

In our end (AKS), we observed that there is a case that kubelet could reject pods with NodeAffinty status if the node watch call got closed due to unknown reason during kubelet start up.

An illustrated procedure is (using Tx to indicate time):

  • T1: kubelet starts, spawns the processes for registering node (create node) and node informer for syncing the node object in parallel

  • T2: the node informer successfully listed empty nodes list from kube-apiserver , proceeding to watch call

    if fallbackToList {
    err = r.list(ctx)
    if err != nil {
    return err
    }
    }
    logger.V(2).Info("Caches populated", "type", r.typeDescription, "reflector", r.name)
    return r.watchWithResync(ctx, w)

  • T3: the watch call stopped due to unknown issue, for instance network closed. In this case, the watch call would exit with error like:

    very short watch: k8s.io/client-go/informers/factory.go:160: Unexpected watch close - watch lasted less than a second and no items received

    In this case, the node informer will back-off and retry with new list-watch steps:

    wait.BackoffUntil(func() {
    if err := r.ListAndWatchWithContext(ctx); err != nil {
    r.watchErrorHandler(ctx, r, err)
    }
    }, r.backoffManager, true, ctx.Done())

  • T4: kubelet successfully created the node object in kube-apiserver side. Kube-scheduler kicks in and assigns pods to the node

  • T5: because in T3 the list call was succeeded with empty results, the node informer had been marked as synced:

nodeHasSynced = func() bool {
return kubeInformers.Core().V1().Nodes().Informer().HasSynced()
}

func (f *DeltaFIFO) HasSynced() bool {
f.lock.Lock()
defer f.lock.Unlock()
return f.hasSynced_locked()
}
func (f *DeltaFIFO) hasSynced_locked() bool {
return f.populated && f.initialPopulationCount == 0
}

In this case, kubelet starts syncing pods and attempts to run. If these pods are set with affinity settings, the kubelet node admission check will fail the pods with "NodeAffinity" status because the node informer cache is empty. Kubelet uses a minimal node object without expected node labels for performing admission check:
func (kl *Kubelet) getNodeAnyWay() (*v1.Node, error) {
if kl.kubeClient != nil {
if n, err := kl.nodeLister.Get(string(kl.nodeName)); err == nil {
return n, nil
}
}
return kl.initialNode(context.TODO())
}

  • T6: kubelet node informer retry kicks in, this time it successfully synced the node object from remote. Now retried pods scheduling on this node would succeed as the node labels are expected

In our reproduction, we captured the following logs:

T1: kubelet[3538]: I1225 01:02:59.216231    3538 kubelet.go:2346] "Starting kubelet main sync loop"
T2: kubelet[3538]: I1225 01:04:04.030571    3538 reflector.go:359] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:160
T3: kubelet[3538]: W1225 01:04:04.170608    3538 reflector.go:470] k8s.io/client-go/informers/factory.go:160: watch of *v1.Node ended with: very short watch: k8s.io/client-go/informers/factory.go:160: Unexpected watch close - watch lasted less than a second and no items received
T4: kubelet[3538]: I1225 01:04:08.309280    3538 kubelet_node_status.go:76] "Successfully registered node" node="<node-name>"
T5 (1): kubelet[3538]: I1225 01:04:29.348027    3538 predicate.go:164] "Predicate failed on Pod" pod="<namespace>/<pod>" err="Predicate NodeAffinity failed"
T5 (2): kubelet[3538]: E1225 01:04:29.351153    3538 kubelet_node_status.go:462] "Error getting the current node from lister" err="node \"<node-name>\" not found"
T6: kubelet[3538]: I1225 01:04:56.591752    3538 reflector.go:359] Caches populated for *v1.Node from k8s.io/client-go/informers/factory.go:160

In the normal case, the retry back off (between T6 and T5) should be short and kubelet should not leave many pods failed. However, in rare scenarios that the node lister might be failing for some other reasons before the first successful call (waiting for TLS bootstrapping for example), and increased the back-off (max out to 30s), then in theory, the gap of T6-T5 could be as long as 30s + jitter, which aligns with the logs timestamps we observed in above.

What did you expect to happen?

Kubelet should back off the node list-watch calls until the node object has been populated in the node informer cache. Or, we should invalidate the node informer cache after successfully registering the node from kubelet to maintain the correct version of node object in memory.

How can we reproduce it (as minimally and precisely as possible)?

Simulate the watch closed call like mentioned above.

Anything else we need to know?

Related issues/PRs:

Kubernetes version

1.30 - but we believe this behavior is reproducible in all verisons.

Cloud provider

AKS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/backlogHigher priority than priority/awaiting-more-evidence.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    • Status

      Triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions