-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376
node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2 #116376
Conversation
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
95acf5c
to
919bbe8
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
919bbe8
to
54d7083
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
54d7083
to
cd9bfcb
Compare
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial |
/test pull-kubernetes-unit |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrunalp, swatisehgal The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
.. if the checkpoint file indicates there should be running device plugins around. This could have been a "hot" kubelet restart that will check all running pods have their devices ready and evict them if that is not the case. However the device plugins need a bit of a time to properly re-register after losing the connection to kubelet. The behavior is a side effect of a more stringent device health checking from kubernetes#116376
What type of PR is this?
/kind bug
What this PR does / why we need it:
This is an attempt to merge changes proposed as part of #114640. The PR was reverted due to serial test failures and this PR updates device plugin and device manager e2e tests to avoid serial test failures.
Core changes were squshed into a single commit: dc1a592632
node: device-mgr: Handle recovery by checking if healthy devices exist
which was already approved in #114640 (0aa6a57, a799ffb and 5b2a3db).Other commits are related to device plugin and device manager e2e tests. The PR adds a bunch of test cases to ensure comprehensive testing but the main focus is to recreate a scenario (where on kubelet restart application pod requesting devices comes up before device plugin pod re-registers itself) resulting in pod admission error.
Rationale of the PR as captured in the previous PR:
In case of node reboot/kubelet restart, the flow of events involves obtaining the state from the checkpoint file followed by setting the
healthDevices
/unhealthyDevices
to its zero value. This is done to allow the device plugin to re-register itself so that capacity can be updated appropriately.During the allocation phase, we need to check if the resources requested by the pod have been registered AND healthy devices are present on the node to be allocated.
Also we need to move this check above
needed==0
where needed is required - devices allocated to the container (which is obtained from the checkpoint file) because even in cases where no additional devices have to be allocated (as they were pre-allocated), we still need to make sure that the devices that were previously allocated are healthy.For more details refer to the comment here: #109595 (comment).
Which issue(s) this PR fixes:
Fixes #109595 and (possibly) #107928 (TODO: need to check the latter)
Special notes for your reviewer:
NOTE:
This PR depends on node: e2e device plugin test improvements #117057 for device plugin e2e test refactoring work and builds on top of that.
This PR depends on sample device plugin changes proposed in: node: device-mgr: sample device plugin: Add support to control registration process #115107.
Those changes are included here (first two commits) as they are needed for e2e test implementation.
NOTE: On node reboot, in case the application pod requesting devices starts after the device plugin pod providing those devices as has been described in #109595 and #107928 we exit early as there are pre-allocated devices without checking if the devices have registered and are healthy. This PR adds both those checks.
Because of this change, in cases where application pods are started before device plugin pod (say after node reboot), because devices are not healthy, the pod would fail with
UnexpectedAdmissionError
error. If the pod is part of a deployment, another pod would be created but that would stay inPending
state waiting for the pod to be scheduled to the node. This pod would goRunning
after the node capacity is updated followed by start up of device plugin pod and its registration.The pod which fails at admission time continues to exist on the node and needs to be removed manually:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: