Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

Merged
merged 3 commits into from
Aug 9, 2021

Conversation

cynepco3hahue
Copy link

@cynepco3hahue cynepco3hahue commented Jul 28, 2021

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

Do not remove admitted pods from the state until the pod passed the admission phase.

  • CPU manager
  • Memory manager
  • Device manager

Which issue(s) this PR fixes:

Fixes #103952

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixes a 1.22 regression in kubelet pod admission

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Signed-off-by: Artyom Lukianov alukiano@redhat.com

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 28, 2021
@k8s-ci-robot k8s-ci-robot requested review from klueska and resouer July 28, 2021 16:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 28, 2021
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 28, 2021
@cynepco3hahue
Copy link
Author

@smarterclayton @klueska @fromanirh FYI

@cynepco3hahue
Copy link
Author

/test pull-kubernetes-node-kubelet-serial-cpu-manager

activeAndAdmittedPods := m.activePods()
for _, pod := range m.admittedPods {
activeAndAdmittedPods = append(activeAndAdmittedPods, pod)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only partially correct. You need to prune the admitted pods map to only include those that are still in either config or pod worker but not in both (a pod that is in the pod worker is by definition admitted). You also want to exclude pods that are no longer running due to termination (which means we will never start the container again, thus we can reuse those CPUs). You then need to take that pruned list of admitted pods and add it to active pods to get the list of "pods that should be considered to be allocated guaranteed resources"

I think we'll need to add a method on kubelet to make this easier - looking at what that would be right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, I think we have a bug in that these components need to know about pods that are terminating that have been force deleted (which activePods() does not include). @Random-Liu I think that components like the cpu_manager, memory_manager, qos_container_manager, and device_manager actually need to know pods that have been force deleted but are still running. To do that, I'm going to need to enable the pod worker to share that info safely (and consistently, i think).

Copy link
Contributor

@klueska klueska Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you suggest here is necessary. As you mentioned in your comment here, all we should really need to care about is whether the pod is:

(1) in the active pods list; or
(2) some pod currently waiting to be admitted

Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient. If it doesn't make it into the active list, then we shouldn't be tracking it anyway, and if it's ever removed from the list, we should be done tracking it.

We just then need to track a single variable to hold the "currently being admitting" pod that gets overwritten each time a new iteration of the admit loop starts.

The check in removeStaleState() would then remove any state not currently associated with a pod in the active list + this new variable.

Of course this is all contingent on what I said before:

Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient.

Maybe there's good reason not to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) in the active pods list; or

The bug is that the active pods list does not include force deleted pods that may take tens of minutes to complete that may still have running containers that are pinned to certain CPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a problem though? I would think we can safely remove the CPUs assigned to them then, and the next time around the reconcile loop, their containers will be moved onto non-exclusive CPUs.

Copy link
Contributor

@ffromani ffromani Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, I guess if activePods() used to include terminated (but still running) pods before, then we never would have removed stale state about them here, making us „think“ their CPUs were free (even though containers are still running on them).

AFAIU/CT this is the most pressing question now. I fully agree about the general idea to restore the old (and, we learned, not covering some important cases) behaviour and iterating later to actually cover them. The former is important and urgent, the later important as well but less urgent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ActivePods() never returned pods that were force deleted. So you have always been broken on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, can we restore at least the old behavior under the current PR, and once we will have the infrastructure to get force deleted pods we can improve the logic under resource managers.

Copy link
Author

@cynepco3hahue cynepco3hahue Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened an issue to monitor force deleted pods problem - #104099

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this does not block this change, and probably the fix for 104099 should make the code changes here simpler (but we don't have to do it now).

@ffromani
Copy link
Contributor

/assign @klueska
Kevin, this is the current direction for fixing the resource managers.

@@ -430,3 +452,19 @@ func (m *manager) GetAllocatableMemory() []state.Block {
func (m *manager) GetMemory(podUID, containerName string) []state.Block {
return m.state.GetMemoryBlocks(podUID, containerName)
}

func (m *manager) updateAdmittedPods(pod *v1.Pod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: should we try to generalize this logic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try to generalize it, but it is a small chunk of code so for now, I do not want to overcomplicate it with additional structures and interfaces.
We have some additional places that we can generalize and it is worth creating a separate PR for all generalization changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine for me.

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2021
@cynepco3hahue
Copy link
Author

There's no bot support in upstream k8s, you can run the ./hack/cherry_pick_pull.sh script (or I can)

Thanks for the information, I will do it.

k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021
…f-#103979-release-1.22

Automated cherry pick of #103979: cpu manager: do not clean admitted pods from the state
@smarterclayton
Copy link
Contributor

Note that I have a PR I'll open soon that correctly accounts for all admitted and still running pods from the pod worker (I.e. any force deleted pod may still be terminating, so GetActivePods is incorrect today). When that lands it will fix the problem mentioned in the review threads where GetActivePods() is currently not broad enough.

@cynepco3hahue
Copy link
Author

Note that I have a PR I'll open soon that correctly accounts for all admitted and still running pods from the pod worker (I.e. any force deleted pod may still be terminating, so GetActivePods is incorrect today). When that lands it will fix the problem mentioned in the review threads where GetActivePods() is currently not broad enough.

Great, thanks for the update.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 9, 2023
likakuli added a commit to likakuli/kubernetes that referenced this pull request Sep 14, 2023
Signed-off-by: likakuli <1154584512@qq.com>
likakuli added a commit to likakuli/kubernetes that referenced this pull request Dec 16, 2023
Signed-off-by: likakuli <1154584512@qq.com>
likakuli added a commit to likakuli/kubernetes that referenced this pull request Jun 4, 2024
Signed-off-by: likakuli <1154584512@qq.com>
likakuli added a commit to likakuli/kubernetes that referenced this pull request Oct 8, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 8, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 10, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 10, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 14, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 23, 2024
Signed-off-by: likakuli <1154584512@qq.com>
felipeagger pushed a commit to felipeagger/kubernetes that referenced this pull request Nov 6, 2024
Signed-off-by: likakuli <1154584512@qq.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 27, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Nov 28, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
ffromani added a commit to ffromani/kubernetes that referenced this pull request Dec 4, 2024
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: kubernetes#128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by kubernetes#103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR kubernetes#120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

The CPU manager does not work correctly for the guaranteed pod with multiple containers
6 participants