Skip to content

Running pods with devices are terminated if kubelet is restartedΒ #118559

Closed
@vasiliy-ul

Description

What happened?

In KubeVirt project, we now see a regression when running on Kubernetes 1.25.10 | 1.26.5 | 1.27.2. If kubelet is restarted on a node, then all the existing and running workloads that use devices are terminated with UnexpectedAdmissionError:

Warning  UnexpectedAdmissionError  45s   kubelet            Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/kvm, which is unexpected
Normal   Killing                   42s   kubelet            Stopping container compute

KubeVirt runs virtual machines inside pods and uses a device plugin to advertise e.g. /dev/kvm on the nodes.

Presumably, this PR changed the behavior: #116376
Original issue: #109595

What did you expect to happen?

A potential restart of kubelet should not interrupt the running workloads.

How can we reproduce it (as minimally and precisely as possible)?

with KubeVirt:

  • run a KubeVirt VM
  • pkill kubelet
  • observe that the workload pod gets terminated

or with https://github.com/k8stopologyawareschedwg/sample-device-plugin

  • make deploy
  • make test-both
  • pkill kubelet
  • the pod gets restarted

Anything else we need to know?

No response

Kubernetes version

This affects the 1.25.x, 1.26.x and 1.27.x branches.

1.25.10 | 1.26.5 | 1.27.2

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    • Status

      Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions