Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) #128443

Open
bart0sh opened this issue Oct 30, 2024 · 12 comments · May be fixed by #129010
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bart0sh
Copy link
Contributor

bart0sh commented Oct 30, 2024

Which jobs are flaking?

pull-kubernetes-node-kubelet-containerd-flaky

Which tests are flaking?

E2eNode Suite: [It] [sig-node] Device Plugin [NodeFeature:DevicePlugin] [Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [Flaky]

Since when has it been flaking?

I've noticed it only once today (Oct 30 2024), but I can see similar failures in other jobs, e.g. #128114

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-kubernetes-node-kubelet-containerd-flaky/1851562955540271104

Reason for failure (if possible)

From the first look the failure could happen because of the bug in the device plugin registration process that prevents it from re-registering after Kubelet reboot. It doesn't happen consistently though.

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@bart0sh bart0sh added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 30, 2024
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 30, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 30, 2024
@bart0sh bart0sh changed the title [Flaking Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [Failing Test] [sig-node] [NodeFeature:DevicePlugin] [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) Nov 10, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Nov 10, 2024

/cc @SergeyKanzhelev @ffromani @kannon92 @pacoxu

Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) test case is not flaking but consistently failing. Looking at the test and kubelet code, I'm not sure I fully understand test logic, in particular this part. How can we expect podresources API to return device assignments if kubelet is not connected to the device plugin after its restart?

Here is a quote from the log (with additional debugging output) from my environment with my comments:

  1. device assignments are present in the checkpoiont and kubelet reads them from there when restarted:
I1110 21:39:16.042459    4945 pod_devices.go:233] "Get checkpoint entry" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" resourceName="example.com/resource" deviceIDs={"-1":["Dev-1"]} allocated="\x12\x18\n\n/tmp/Dev-1\x12\n/tmp/Dev-1"
I1110 21:39:16.042499    4945 pod_devices.go:243] ">>> fromCheckpointData: inserted device" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" resourceName="example.com/resource" deviceIDs={"-1":["Dev-1"]} allocResp="&ContainerAllocateResponse{Envs:map[string]string{},Mounts:[]*Mount{&Mount{ContainerPath:/tmp/Dev-1,HostPath:/tmp/Dev-1,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},CDIDevices:[]*CDIDevice{},}" pdev.devs={"1d67afb3-2852-4368-b7a0-0be7dccb4bec":{"device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2":{"example.com/resource":{}}}}
I1110 21:39:16.042599    4945 manager.go:535] ">>> readCheckpoint" podDevices=[{"PodUID":"1d67afb3-2852-4368-b7a0-0be7dccb4bec","ContainerName":"device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2","ResourceName":"example.com/resource","DeviceIDs":{"-1":["Dev-1"]},"AllocResp":"EhgKCi90bXAvRGV2LTESCi90bXAvRGV2LTE="}] registeredDevs={"example.com/resource":["Dev-1","Dev-2"]} allocatedDevices={"example.com/resource":{"Dev-1":{}}}
I1110 21:39:16.966102    4945 manager.go:586] "Found pre-allocated devices for resource on pod" resourceName="example.com/resource" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" devices=["Dev-1"]
  1. However, pod is considered orphaned after pod admission failure and scheduled for removal:
I1110 21:39:16.966193    4945 kubelet.go:2356] "Pod admission denied" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" reason="UnexpectedAdmissionError" message="Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/resource, which is unexpected"
I1110 21:39:16.972463    4945 status_manager.go:911] "Patch status for pod" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" patch="{\"metadata\":{\"uid\":\"1d67afb3-2852-4368-b7a0-0be7dccb4bec\"},\"status\":{\"conditions\":null,\"containerStatuses\":null,\"hostIP\":null,\"hostIPs\":null,\"podIP\":null,\"podIPs\":null}}"
I1110 21:39:18.028603    4945 kubelet_pods.go:2516] "Orphaned pod found, removing pod cgroups" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec"
I1110 21:39:26.032579    4945 status_manager.go:911] "Patch status for pod" pod="device-plugin-errors-2623/device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" patch="{\"metadata\":{\"uid\":\"1d67afb3-2852-4368-b7a0-0be7dccb4bec\"},\"status\":{\"message\":\"Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices example.com/resource, which is unexpected\",\"phase\":\"Failed\",\"reason\":\"UnexpectedAdmissionError\"}}"
I1110 21:39:26.922229    4945 manager.go:570] "Pods to be removed" podUIDs=["1d67afb3-2852-4368-b7a0-0be7dccb4bec"]
  1. After that podresources List API returns only empty device list for the test pod as device manager doesn't have any info about it anymore:
I1110 21:39:26.922318    4945 pod_devices.go:382] ">>> getContainerDevices" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec" containerName="device-plugin-test-a547be91-2af5-4511-aec7-8dadbf2001a2" pdev.devs={}
I1110 21:39:26.922329    4945 pod_devices.go:385] ">>> getContainerDevices: pod doesn't exist" podUID="1d67afb3-2852-4368-b7a0-0be7dccb4bec"

@ffromani Does the above make sense to you or I'm missing something obvious?

@bart0sh
Copy link
Contributor Author

bart0sh commented Nov 10, 2024

/kind failing-test

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 10, 2024
@ffromani
Copy link
Contributor

the test was working until #126343 which highlighted the test is not well written.

The spirit of the test is to check that if the node is rebooted without drain, when it restart it should recreate the pods and the pods should keep the previous device assignment. This was to encode a user scenario presented back in time.

We should probably update the test, I'll have a look ASAP.

@pacoxu
Copy link
Member

pacoxu commented Nov 11, 2024

@bart0sh
Copy link
Contributor Author

bart0sh commented Nov 11, 2024

@ffromani

the test was working until #126343

JFYI: Reverting the #126343 doesn't fix the test for me.

@ffromani
Copy link
Contributor

@ffromani

the test was working until #126343

JFYI: Reverting the #126343 doesn't fix the test for me.

point taken. 126343 is IMO the final bit which demonstrates this test need review/rewrite, because the preconditions this test was taking for granted are very likely no longer valid (and just to be obvious: 126343 is IMO the correct thing to do and by all means let's keep it).

@bart0sh
Copy link
Contributor Author

bart0sh commented Nov 11, 2024

as far as I can see in the test grid, the test was green on 29 Oct 2024. #126343 was merged back in August, so it couldn't be the reason of the recent failures.

@ffromani
Copy link
Contributor

ffromani commented Nov 11, 2024

thanks for checking, I was wrong. Will restart from square 1 :)
(Still I'm suspicious the test is still valid nowadays, but let's put that aside now)

ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 19, 2024
Please see k/k issue kubernetes#128443 for more context.

A device manager test is permafailing, and has been like so for few
weeks. Skip the offending test until we have time to deep dive
and understand the failure. Considering recent changes, there's
a fair chance the test is now obsolete, because its preconditions
are now void, which is causing the permafail.
We may need to update or remove it.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani
Copy link
Contributor

/triage accepted
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@ffromani ffromani moved this from Triage to Issues - In progress in SIG Node CI/Test Board Nov 20, 2024
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 20, 2024
Please see k/k issue kubernetes#128443 for more context.

A device manager test is permafailing, and has been like so for few
weeks. Skip the offending test until we have time to deep dive
and understand the failure. Considering recent changes, there's
a fair chance the test is now obsolete, because its preconditions
are now void, which is causing the permafail.
We may need to update or remove it.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@kannon92
Copy link
Contributor

We have not merged #128870 so I don’t expect this to be fixed yet. The main option seems to be to disable this test.

@bart0sh
Copy link
Contributor Author

bart0sh commented Nov 27, 2024

@kannon92 @dims @ffromani reverting part of #127506 seems to fix this issue. PTAL at #128995

ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
@ffromani ffromani linked a pull request Nov 27, 2024 that will close this issue
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this issue Dec 4, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Issues - In progress
6 participants