node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

swatisehgal · 2023-03-08T14:10:51Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This is an attempt to merge changes proposed as part of #114640. The PR was reverted due to serial test failures and this PR updates device plugin and device manager e2e tests to avoid serial test failures.

Core changes were squshed into a single commit: dc1a592632 node: device-mgr: Handle recovery by checking if healthy devices exist which was already approved in #114640 (0aa6a57, a799ffb and 5b2a3db).

Other commits are related to device plugin and device manager e2e tests. The PR adds a bunch of test cases to ensure comprehensive testing but the main focus is to recreate a scenario (where on kubelet restart application pod requesting devices comes up before device plugin pod re-registers itself) resulting in pod admission error.

Rationale of the PR as captured in the previous PR:

In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the healthDevices/unhealthyDevices to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.

Also we need to move this check above needed==0 where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.

For more details refer to the comment here: #109595 (comment).

Which issue(s) this PR fixes:

Fixes #109595 and (possibly) #107928 (TODO: need to check the latter)

Special notes for your reviewer:

NOTE:

This PR depends on node: e2e device plugin test improvements #117057 for device plugin e2e test refactoring work and builds on top of that.
This PR depends on sample device plugin changes proposed in: node: device-mgr: sample device plugin: Add support to control registration process #115107.

Those changes are included here (first two commits) as they are needed for e2e test implementation.

NOTE: On node reboot, in case the application pod requesting devices starts after the device plugin pod providing those devices as has been described in #109595 and #107928 we exit early as there are pre-allocated devices without checking if the devices have registered and are healthy. This PR adds both those checks.

Because of this change, in cases where application pods are started before device plugin pod (say after node reboot), because devices are not healthy, the pod would fail with UnexpectedAdmissionError error. If the pod is part of a deployment, another pod would be created but that would stay in Pending state waiting for the pod to be scheduled to the node. This pod would go Running after the node capacity is updated followed by start up of device plugin pod and its registration.

The pod which fails at admission time continues to exist on the node and needs to be removed manually:

kubectl delete pods --field-selector status.phase=Failed

Does this PR introduce a user-facing change?

During device plugin allocation, resources requested by the pod can only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated. If these conditions are not sattsfied, the pod would fail with `UnexpectedAdmissionError` error.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

swatisehgal · 2023-03-08T14:15:41Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

swatisehgal · 2023-03-09T19:06:24Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

swatisehgal · 2023-03-09T19:34:11Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

swatisehgal · 2023-03-09T21:41:53Z

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

swatisehgal · 2023-03-09T23:05:45Z

/test pull-kubernetes-unit

k8s-ci-robot · 2023-05-02T03:27:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, swatisehgal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/OWNERS~~ [mrunalp]
~~test/e2e_node/OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

.. if the checkpoint file indicates there should be running device plugins around. This could have been a "hot" kubelet restart that will check all running pods have their devices ready and evict them if that is not the case. However the device plugins need a bit of a time to properly re-register after losing the connection to kubelet. The behavior is a side effect of a more stringent device health checking from kubernetes#116376

k8s-ci-robot requested review from andrewsykim and klueska March 8, 2023 14:11

swatisehgal force-pushed the device-mgr-recovery-wip branch from 95acf5c to 919bbe8 Compare March 9, 2023 19:05

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 9, 2023

swatisehgal changed the title ~~[WIP] node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2~~ node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 Mar 9, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 9, 2023

swatisehgal force-pushed the device-mgr-recovery-wip branch from 919bbe8 to 54d7083 Compare March 9, 2023 19:32

swatisehgal force-pushed the device-mgr-recovery-wip branch from 54d7083 to cd9bfcb Compare March 9, 2023 21:41

mrunalp approved these changes May 2, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2023

k8s-ci-robot merged commit 1241ddc into kubernetes:master May 2, 2023

k8s-ci-robot added this to the v1.28 milestone May 2, 2023

swatisehgal deleted the device-mgr-recovery-wip branch May 3, 2023 12:38

swatisehgal mentioned this pull request May 10, 2023

node: ci: Enable nodefeature test execution in CI for release branches kubernetes/test-infra#29483

Closed

ffromani mentioned this pull request Jun 5, 2023

the kubelet should respect the pod priority when initializing after restart #118452

Closed

vasiliy-ul mentioned this pull request Jun 5, 2023

device manager can incorrectly recover devices on node restart #109595

Closed

swatisehgal mentioned this pull request Jun 7, 2023

node: sample-device-plugin: register to kubelet by default and ensure re-registration to kubelet on kubelet restarts #118534

Merged

ormergi mentioned this pull request Jun 7, 2023

Bump kubevirtci kubevirt/kubevirt#9616

Merged

vasiliy-ul mentioned this pull request Jun 8, 2023

Running pods with devices are terminated if kubelet is restarted #118559

Closed

adilGhaffarDev mentioned this pull request Feb 29, 2024

Add a TTL for Pods on workloads other than Jobs #122187

Closed

liuyuan10 mentioned this pull request Jun 19, 2024

Node reboot leaving existing pod using resources stuck with error UnexpectedAdmissionError #125579

Open

swatisehgal mentioned this pull request Nov 29, 2024

device manager: do not clean admitted pods from the state #128995

Closed

yylt mentioned this pull request Jan 14, 2025

Fix create failed when resource required is zero #128855

Open

swatisehgal mentioned this pull request Jan 22, 2025

devicemanager: add timeout for GetPreferredAllocation and Allocate RPC #129248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

swatisehgal commented Mar 8, 2023 •

edited

Loading

swatisehgal commented Mar 8, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

k8s-ci-robot commented May 2, 2023

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

Conversation

swatisehgal commented Mar 8, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

swatisehgal commented Mar 8, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

swatisehgal commented Mar 9, 2023

k8s-ci-robot commented May 2, 2023

swatisehgal commented Mar 8, 2023 •

edited

Loading