[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

bart0sh · 2024-10-30T11:19:00Z

Which jobs are flaking?

pull-kubernetes-node-kubelet-containerd-flaky

Which tests are flaking?

E2eNode Suite: [It] [sig-node] Device Plugin [NodeFeature:DevicePlugin] [Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [Flaky]

Since when has it been flaking?

I've noticed it only once today (Oct 30 2024), but I can see similar failures in other jobs, e.g. #128114

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-kubelet-containerd-flaky/1851562955540271104

Reason for failure (if possible)

From the first look the failure could happen because of the bug in the device plugin registration process that prevents it from re-registering after Kubelet reboot. It doesn't happen consistently though.

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

bart0sh · 2024-11-10T22:55:51Z

/cc @SergeyKanzhelev @ffromani @kannon92 @pacoxu

Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) test case is not flaking but consistently failing. Looking at the test and kubelet code, I'm not sure I fully understand test logic, in particular this part. How can we expect podresources API to return device assignments if kubelet is not connected to the device plugin after its restart?

Here is a quote from the log (with additional debugging output) from my environment with my comments:

device assignments are present in the checkpoiont and kubelet reads them from there when restarted:

I1110 21:39:16.042459    4945 pod_devices.go:233] "Get checkpoint entry" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" resourceName="example.com/resource" deviceIDs={"-1":["Dev-1"]} allocated="\x12\x18\n\n/tmp/Dev-1\x12\n/tmp/Dev-1"
I1110 21:39:16.042499    4945 pod_devices.go:243] ">>> fromCheckpointData: inserted device" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" resourceName="example.com/resource" deviceIDs={"-1":["Dev-1"]} allocResp="&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[]*Mount{&Mount{ContainerPath:/tmp/Dev-1,HostPath:/tmp/Dev-1,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},CDIDevices:[]*CDIDevice{},}" pdev.devs={"1d67afb3-2852-4368-b7a0-0be7dccb4bec":{"device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2":{"example.com/resource":{}}}}
I1110 21:39:16.042599    4945 manager.go:535] ">>> readCheckpoint" podDevices=[{"PodUID":"1d67afb3-2852-4368-b7a0-0be7dccb4bec","ContainerName":"device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2","ResourceName":"example.com/resource","DeviceIDs":{"-1":["Dev-1"]},"AllocResp":"EhgKCi90bXAvRGV2LTESCi90bXAvRGV2LTE="}] registeredDevs={"example.com/resource":["Dev-1","Dev-2"]} allocatedDevices={"example.com/resource":{"Dev-1":{}}}
I1110 21:39:16.966102    4945 manager.go:586] "Found pre-allocated devices for resource on pod" resourceName="example.com/resource" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" devices=["Dev-1"]

However, pod is considered orphaned after pod admission failure and scheduled for removal:

I1110 21:39:16.966193    4945 kubelet.go:2356] "Pod admission denied" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" reason="UnexpectedAdmissionError" message="Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/resource, which is unexpected"
I1110 21:39:16.972463    4945 status_manager.go:911] "Patch status for pod" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" patch="{\"metadata\":{\"uid\":\"1d67afb3-2852-4368-b7a0-0be7dccb4bec\"},\"status\":{\"conditions\":null,\"containerStatuses\":null,\"hostIP\":null,\"hostIPs\":null,\"podIP\":null,\"podIPs\":null}}"
I1110 21:39:18.028603    4945 kubelet_pods.go:2516] "Orphaned pod found, removing pod cgroups" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec"
I1110 21:39:26.032579    4945 status_manager.go:911] "Patch status for pod" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" patch="{\"metadata\":{\"uid\":\"1d67afb3-2852-4368-b7a0-0be7dccb4bec\"},\"status\":{\"message\":\"Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/resource, which is unexpected\",\"phase\":\"Failed\",\"reason\":\"UnexpectedAdmissionError\"}}"
I1110 21:39:26.922229    4945 manager.go:570] "Pods to be removed" podUIDs=["1d67afb3-2852-4368-b7a0-0be7dccb4bec"]

After that podresources List API returns only empty device list for the test pod as device manager doesn't have any info about it anymore:

I1110 21:39:26.922318    4945 pod_devices.go:382] ">>> getContainerDevices" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" pdev.devs={}
I1110 21:39:26.922329    4945 pod_devices.go:385] ">>> getContainerDevices: pod doesn't exist" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec"

@ffromani Does the above make sense to you or I'm missing something obvious?

bart0sh · 2024-11-10T23:00:48Z

/kind failing-test

ffromani · 2024-11-11T08:13:11Z

the test was working until #126343 which highlighted the test is not well written.

The spirit of the test is to check that if the node is rebooted without drain, when it restart it should recreate the pods and the pods should keep the previous device assignment. This was to encode a user scenario presented back in time.

We should probably update the test, I'll have a look ASAP.

pacoxu · 2024-11-11T08:18:52Z

Could you confirm if this is a release blocker of v1.32? @SergeyKanzhelev

https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial
https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial
https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-flaky : ci-kubernetes-node-kubelet-containerd-flaky

Three flaky boards.

bart0sh · 2024-11-11T16:10:10Z

@ffromani

the test was working until #126343

JFYI: Reverting the #126343 doesn't fix the test for me.

ffromani · 2024-11-11T16:17:47Z

@ffromani

the test was working until #126343

JFYI: Reverting the #126343 doesn't fix the test for me.

point taken. 126343 is IMO the final bit which demonstrates this test need review/rewrite, because the preconditions this test was taking for granted are very likely no longer valid (and just to be obvious: 126343 is IMO the correct thing to do and by all means let's keep it).

bart0sh · 2024-11-11T16:18:16Z

as far as I can see in the test grid, the test was green on 29 Oct 2024. #126343 was merged back in August, so it couldn't be the reason of the recent failures.

ffromani · 2024-11-11T16:21:20Z

thanks for checking, I was wrong. Will restart from square 1 :)
(Still I'm suspicious the test is still valid nowadays, but let's put that aside now)

Please see k/k issue kubernetes#128443 for more context. A device manager test is permafailing, and has been like so for few weeks. Skip the offending test until we have time to deep dive and understand the failure. Considering recent changes, there's a fair chance the test is now obsolete, because its preconditions are now void, which is causing the permafail. We may need to update or remove it. Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2024-11-20T18:37:04Z

/triage accepted
/assign

Please see k/k issue kubernetes#128443 for more context. A device manager test is permafailing, and has been like so for few weeks. Skip the offending test until we have time to deep dive and understand the failure. Considering recent changes, there's a fair chance the test is now obsolete, because its preconditions are now void, which is causing the permafail. We may need to update or remove it. Signed-off-by: Francesco Romani <fromani@redhat.com>

dims · 2024-11-24T02:18:37Z

still a problem :(

https://storage.googleapis.com/k8s-triage/index.html?test=Keeps%20device%20plugin%20assignments%20across%20pod%20and%20kubelet%20restarts

kannon92 · 2024-11-24T03:29:31Z

We have not merged #128870 so I don’t expect this to be fixed yet. The main option seems to be to disable this test.

bart0sh · 2024-11-27T09:41:20Z

@kannon92 @dims @ffromani reverting part of #127506 seems to fix this issue. PTAL at #128995

We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>

We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>

bart0sh added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 30, 2024

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 30, 2024

github-project-automation bot added this to SIG Node CI/Test Board Oct 30, 2024

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Oct 30, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 30, 2024

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 10, 2024

pacoxu mentioned this issue Nov 11, 2024

[Failing Test] [Serial] [Disruptive] Keeps device plugin assignments across node reboots #128734

Closed

pacoxu mentioned this issue Nov 11, 2024

[Flaking Test][NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments after kubelet restart and device plugin restart (no pod restart) #128114

Open

ffromani mentioned this issue Nov 19, 2024

Failure cluster [f78c3416...] Keeps device plugin assignments across XYZ #128860

Closed

ffromani mentioned this issue Nov 19, 2024

e2e: node: test skip permafailing test #128870

Closed

k8s-ci-robot assigned ffromani Nov 20, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024

ffromani moved this from Triage to Issues - In progress in SIG Node CI/Test Board Nov 20, 2024

ffromani mentioned this issue Nov 20, 2024

fix device manager permafailing test #128895

Open

bart0sh mentioned this issue Nov 27, 2024

device manager: do not clean admitted pods from the state #128995

Closed

ffromani linked a pull request Nov 27, 2024 that will close this issue

node: e2e: fix device plugin reboot test #129010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

bart0sh commented Oct 30, 2024

bart0sh commented Nov 10, 2024

bart0sh commented Nov 10, 2024

ffromani commented Nov 11, 2024

pacoxu commented Nov 11, 2024 •

edited

Loading

bart0sh commented Nov 11, 2024 •

edited

Loading

ffromani commented Nov 11, 2024

bart0sh commented Nov 11, 2024 •

edited

Loading

ffromani commented Nov 11, 2024 •

edited

Loading

ffromani commented Nov 20, 2024

dims commented Nov 24, 2024

kannon92 commented Nov 24, 2024

bart0sh commented Nov 27, 2024

[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

Comments

bart0sh commented Oct 30, 2024

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

bart0sh commented Nov 10, 2024

bart0sh commented Nov 10, 2024

ffromani commented Nov 11, 2024

pacoxu commented Nov 11, 2024 • edited Loading

bart0sh commented Nov 11, 2024 • edited Loading

ffromani commented Nov 11, 2024

bart0sh commented Nov 11, 2024 • edited Loading

ffromani commented Nov 11, 2024 • edited Loading

ffromani commented Nov 20, 2024

dims commented Nov 24, 2024

kannon92 commented Nov 24, 2024

bart0sh commented Nov 27, 2024

pacoxu commented Nov 11, 2024 •

edited

Loading

bart0sh commented Nov 11, 2024 •

edited

Loading

bart0sh commented Nov 11, 2024 •

edited

Loading

ffromani commented Nov 11, 2024 •

edited

Loading