Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376

Merged
merged 7 commits into from
May 2, 2023

Conversation

swatisehgal
Copy link
Contributor

@swatisehgal swatisehgal commented Mar 8, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

This is an attempt to merge changes proposed as part of #114640. The PR was reverted due to serial test failures and this PR updates device plugin and device manager e2e tests to avoid serial test failures.

Core changes were squshed into a single commit: dc1a592632 node: device-mgr: Handle recovery by checking if healthy devices exist which was already approved in #114640 (0aa6a57, a799ffb and 5b2a3db).

Other commits are related to device plugin and device manager e2e tests. The PR adds a bunch of test cases to ensure comprehensive testing but the main focus is to recreate a scenario (where on kubelet restart application pod requesting devices comes up before device plugin pod re-registers itself) resulting in pod admission error.

Rationale of the PR as captured in the previous PR:

In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the healthDevices/unhealthyDevices to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.

Also we need to move this check above needed==0 where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.

For more details refer to the comment here: #109595 (comment).

Which issue(s) this PR fixes:

Fixes #109595 and (possibly) #107928 (TODO: need to check the latter)

Special notes for your reviewer:

NOTE:

  1. This PR depends on node: e2e device plugin test improvements #117057 for device plugin e2e test refactoring work and builds on top of that.

  2. This PR depends on sample device plugin changes proposed in: node: device-mgr: sample device plugin: Add support to control registration process #115107.

Those changes are included here (first two commits) as they are needed for e2e test implementation.

NOTE: On node reboot, in case the application pod requesting devices starts after the device plugin pod providing those devices as has been described in #109595 and #107928 we exit early as there are pre-allocated devices without checking if the devices have registered and are healthy. This PR adds both those checks.

Because of this change, in cases where application pods are started before device plugin pod (say after node reboot), because devices are not healthy, the pod would fail with UnexpectedAdmissionError error. If the pod is part of a deployment, another pod would be created but that would stay in Pending state waiting for the pod to be scheduled to the node. This pod would go Running after the node capacity is updated followed by start up of device plugin pod and its registration.

The pod which fails at admission time continues to exist on the node and needs to be removed manually:

kubectl delete pods --field-selector status.phase=Failed

Does this PR introduce a user-facing change?

During device plugin allocation, resources requested by the pod can only be allocated if the device plugin has registered itself to kubelet AND healthy devices are present on the node to be allocated. If these conditions are not sattsfied, the pod would fail with `UnexpectedAdmissionError` error.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 8, 2023
@k8s-ci-robot k8s-ci-robot added area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 8, 2023
@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

@swatisehgal swatisehgal force-pushed the device-mgr-recovery-wip branch from 95acf5c to 919bbe8 Compare March 9, 2023 19:05
@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 9, 2023
@swatisehgal swatisehgal changed the title [WIP] node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 Mar 9, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 9, 2023
@swatisehgal swatisehgal force-pushed the device-mgr-recovery-wip branch from 919bbe8 to 54d7083 Compare March 9, 2023 19:32
@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

@swatisehgal swatisehgal force-pushed the device-mgr-recovery-wip branch from 54d7083 to cd9bfcb Compare March 9, 2023 21:41
@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e

@swatisehgal
Copy link
Contributor Author

/test pull-kubernetes-unit

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, swatisehgal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 2, 2023
@k8s-ci-robot k8s-ci-robot merged commit 1241ddc into kubernetes:master May 2, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 2, 2023
@swatisehgal swatisehgal deleted the device-mgr-recovery-wip branch May 3, 2023 12:38
MarSik added a commit to MarSik/kubernetes that referenced this pull request Jun 7, 2023
.. if the checkpoint file indicates there should be running
device plugins around.

This could have been a "hot" kubelet restart that will check
all running pods have their devices ready and evict them if
that is not the case.

However the device plugins need a bit of a time to properly
re-register after losing the connection to kubelet.

The behavior is a side effect of a more stringent device health checking from
kubernetes#116376
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

device manager can incorrectly recover devices on node restart
7 participants