Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

cynepco3hahue · 2021-07-28T16:21:27Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

Do not remove admitted pods from the state until the pod passed the admission phase.

CPU manager
Memory manager
Device manager

Which issue(s) this PR fixes:

Fixes #103952

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixes a 1.22 regression in kubelet pod admission

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Signed-off-by: Artyom Lukianov alukiano@redhat.com

cynepco3hahue · 2021-07-28T19:02:26Z

@smarterclayton @klueska @fromanirh FYI

cynepco3hahue · 2021-07-28T19:02:29Z

/test pull-kubernetes-node-kubelet-serial-cpu-manager

smarterclayton · 2021-07-28T19:20:31Z

pkg/kubelet/cm/cpumanager/cpu_manager.go

+	activeAndAdmittedPods := m.activePods()
+	for _, pod := range m.admittedPods {
+		activeAndAdmittedPods = append(activeAndAdmittedPods, pod)
+	}


I think this is only partially correct. You need to prune the admitted pods map to only include those that are still in either config or pod worker but not in both (a pod that is in the pod worker is by definition admitted). You also want to exclude pods that are no longer running due to termination (which means we will never start the container again, thus we can reuse those CPUs). You then need to take that pruned list of admitted pods and add it to active pods to get the list of "pods that should be considered to be allocated guaranteed resources"

I think we'll need to add a method on kubelet to make this easier - looking at what that would be right now.

Looking at this, I think we have a bug in that these components need to know about pods that are terminating that have been force deleted (which activePods() does not include). @Random-Liu I think that components like the cpu_manager, memory_manager, qos_container_manager, and device_manager actually need to know pods that have been force deleted but are still running. To do that, I'm going to need to enable the pod worker to share that info safely (and consistently, i think).

I'm not sure what you suggest here is necessary. As you mentioned in your comment here, all we should really need to care about is whether the pod is:

(1) in the active pods list; or
(2) some pod currently waiting to be admitted

Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient. If it doesn't make it into the active list, then we shouldn't be tracking it anyway, and if it's ever removed from the list, we should be done tracking it.

We just then need to track a single variable to hold the "currently being admitting" pod that gets overwritten each time a new iteration of the admit loop starts.

The check in removeStaleState() would then remove any state not currently associated with a pod in the active list + this new variable.

Of course this is all contingent on what I said before:

Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient.

Maybe there's good reason not to do this?

(1) in the active pods list; or

The bug is that the active pods list does not include force deleted pods that may take tens of minutes to complete that may still have running containers that are pinned to certain CPUs.

Is that a problem though? I would think we can safely remove the CPUs assigned to them then, and the next time around the reconcile loop, their containers will be moved onto non-exclusive CPUs.

Although, I guess if activePods() used to include terminated (but still running) pods before, then we never would have removed stale state about them here, making us „think“ their CPUs were free (even though containers are still running on them).

AFAIU/CT this is the most pressing question now. I fully agree about the general idea to restore the old (and, we learned, not covering some important cases) behaviour and iterating later to actually cover them. The former is important and urgent, the later important as well but less urgent.

ActivePods() never returned pods that were force deleted. So you have always been broken on that.

In this case, can we restore at least the old behavior under the current PR, and once we will have the infrastructure to get force deleted pods we can improve the logic under resource managers.

I opened an issue to monitor force deleted pods problem - #104099

I agree this does not block this change, and probably the fix for 104099 should make the code changes here simpler (but we don't have to do it now).

ffromani · 2021-07-29T05:18:59Z

/assign @klueska
Kevin, this is the current direction for fixing the resource managers.

ffromani · 2021-07-30T13:06:01Z

pkg/kubelet/cm/memorymanager/memory_manager.go

@@ -430,3 +452,19 @@ func (m *manager) GetAllocatableMemory() []state.Block {
 func (m *manager) GetMemory(podUID, containerName string) []state.Block {
 	return m.state.GetMemoryBlocks(podUID, containerName)
 }
+
+func (m *manager) updateAdmittedPods(pod *v1.Pod) {


minor: should we try to generalize this logic?

We can try to generalize it, but it is a small chunk of code so for now, I do not want to overcomplicate it with additional structures and interfaces.
We have some additional places that we can generalize and it is worth creating a separate PR for all generalization changes.

fine for me.

ehashman

/priority important-soon
/triage accepted

cynepco3hahue · 2021-08-09T17:02:16Z

There's no bot support in upstream k8s, you can run the ./hack/cherry_pick_pull.sh script (or I can)

Thanks for the information, I will do it.

…f-#103979-release-1.22 Automated cherry pick of #103979: cpu manager: do not clean admitted pods from the state

smarterclayton · 2021-08-24T17:27:25Z

Note that I have a PR I'll open soon that correctly accounts for all admitted and still running pods from the pod worker (I.e. any force deleted pod may still be terminating, so GetActivePods is incorrect today). When that lands it will fix the problem mentioned in the review threads where GetActivePods() is currently not broad enough.

cynepco3hahue · 2021-08-25T12:11:13Z

Note that I have a PR I'll open soon that correctly accounts for all admitted and still running pods from the pod worker (I.e. any force deleted pod may still be terminating, so GetActivePods is incorrect today). When that lands it will fix the problem mentioned in the review threads where GetActivePods() is currently not broad enough.

Great, thanks for the update.

Signed-off-by: likakuli <1154584512@qq.com>

We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>

We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot requested review from klueska and resouer July 28, 2021 16:24

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 28, 2021

cynepco3hahue force-pushed the save_admitted_pods branch from 82276cd to a016c0c Compare July 28, 2021 19:01

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 28, 2021

smarterclayton requested changes Jul 28, 2021

View reviewed changes

k8s-ci-robot assigned klueska Jul 29, 2021

ffromani reviewed Jul 30, 2021

View reviewed changes

ehashman reviewed Aug 2, 2021

View reviewed changes

cynepco3hahue mentioned this pull request Aug 3, 2021

Resource managers can remove allocated resources for terminating pods #104099

Closed

cynepco3hahue mentioned this pull request Aug 10, 2021

Automated cherry pick of #103979: cpu manager: do not clean admitted pods from the state #104267

Merged

k8s-ci-robot added a commit that referenced this pull request Aug 16, 2021

Merge pull request #104267 from cynepco3hahue/automated-cherry-pick-o…

e31b552

…f-#103979-release-1.22 Automated cherry pick of #103979: cpu manager: do not clean admitted pods from the state

smarterclayton mentioned this pull request Aug 24, 2021

WIP: Restrict to running only one goroutine for a mirror pod #104031

Closed

klueska mentioned this pull request Dec 9, 2021

fix: concurrent map writes error in device manager #102525

Closed

cynepco3hahue mentioned this pull request Jan 13, 2022

Exclusive CPUs not removed from deleted Pod and put back in the defaultCPUSet. #107074

Closed

luozhiwenn mentioned this pull request Jan 28, 2023

kubelet: fix exclusive CPUs not removed from deleted Pod #107732

Closed

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 9, 2023

likakuli added a commit to likakuli/kubernetes that referenced this pull request Sep 14, 2023

feat: revert kubernetes#103979 for it's duplicated

743472c

Signed-off-by: likakuli <1154584512@qq.com>

likakuli mentioned this pull request Sep 14, 2023

fix: exclusive CPUs removed from deleted Pod and put back in the defaultCPUSet #120661

Closed

likakuli added a commit to likakuli/kubernetes that referenced this pull request Dec 16, 2023

feat: revert kubernetes#103979 for it's duplicated

6c10020

Signed-off-by: likakuli <1154584512@qq.com>

likakuli added a commit to likakuli/kubernetes that referenced this pull request Jun 4, 2024

feat: revert kubernetes#103979 for it's duplicated

18fcd80

Signed-off-by: likakuli <1154584512@qq.com>

likakuli added a commit to likakuli/kubernetes that referenced this pull request Oct 8, 2024

feat: revert kubernetes#103979 for it's duplicated

091fcfa

Signed-off-by: likakuli <1154584512@qq.com>

ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 8, 2024

feat: revert kubernetes#103979 for it's duplicated

99742a1

Signed-off-by: likakuli <1154584512@qq.com>

ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 10, 2024

feat: revert kubernetes#103979 for it's duplicated

6c19e9b

Signed-off-by: likakuli <1154584512@qq.com>

ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 10, 2024

feat: revert kubernetes#103979 for it's duplicated

d3fa45d

Signed-off-by: likakuli <1154584512@qq.com>

ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 14, 2024

feat: revert kubernetes#103979 for it's duplicated

cc01d60

Signed-off-by: likakuli <1154584512@qq.com>

ffromani pushed a commit to ffromani/kubernetes that referenced this pull request Oct 23, 2024

feat: revert kubernetes#103979 for it's duplicated

212c4c4

Signed-off-by: likakuli <1154584512@qq.com>

felipeagger pushed a commit to felipeagger/kubernetes that referenced this pull request Nov 6, 2024

feat: revert kubernetes#103979 for it's duplicated

de43989

Signed-off-by: likakuli <1154584512@qq.com>

bart0sh mentioned this pull request Nov 27, 2024

device manager: do not clean admitted pods from the state #128995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

cynepco3hahue commented Jul 28, 2021 •

edited by liggitt

Loading

cynepco3hahue commented Jul 28, 2021

cynepco3hahue commented Jul 28, 2021

smarterclayton Jul 28, 2021

smarterclayton Jul 28, 2021

klueska Jul 29, 2021 •

edited

Loading

smarterclayton Jul 29, 2021

klueska Jul 29, 2021

ffromani Aug 2, 2021 •

edited

Loading

smarterclayton Aug 2, 2021

cynepco3hahue Aug 3, 2021

cynepco3hahue Aug 3, 2021 •

edited

Loading

smarterclayton Aug 3, 2021

ffromani commented Jul 29, 2021

ffromani Jul 30, 2021

cynepco3hahue Aug 1, 2021

ffromani Aug 2, 2021

ehashman left a comment

cynepco3hahue commented Aug 9, 2021

smarterclayton commented Aug 24, 2021

cynepco3hahue commented Aug 25, 2021

Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

Do not clear state of pods pending admission for CPU/Memory/Device manager #103979

Conversation

cynepco3hahue commented Jul 28, 2021 • edited by liggitt Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

cynepco3hahue commented Jul 28, 2021

cynepco3hahue commented Jul 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Aug 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cynepco3hahue Aug 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Jul 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehashman left a comment

Choose a reason for hiding this comment

cynepco3hahue commented Aug 9, 2021

smarterclayton commented Aug 24, 2021

cynepco3hahue commented Aug 25, 2021

cynepco3hahue commented Jul 28, 2021 •

edited by liggitt

Loading

klueska Jul 29, 2021 •

edited

Loading

ffromani Aug 2, 2021 •

edited

Loading

cynepco3hahue Aug 3, 2021 •

edited

Loading