-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not clear state of pods pending admission for CPU/Memory/Device manager #103979
Do not clear state of pods pending admission for CPU/Memory/Device manager #103979
Conversation
82276cd
to
a016c0c
Compare
@smarterclayton @klueska @fromanirh FYI |
/test pull-kubernetes-node-kubelet-serial-cpu-manager |
activeAndAdmittedPods := m.activePods() | ||
for _, pod := range m.admittedPods { | ||
activeAndAdmittedPods = append(activeAndAdmittedPods, pod) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only partially correct. You need to prune the admitted pods map to only include those that are still in either config or pod worker but not in both (a pod that is in the pod worker is by definition admitted). You also want to exclude pods that are no longer running due to termination (which means we will never start the container again, thus we can reuse those CPUs). You then need to take that pruned list of admitted pods and add it to active pods to get the list of "pods that should be considered to be allocated guaranteed resources"
I think we'll need to add a method on kubelet to make this easier - looking at what that would be right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, I think we have a bug in that these components need to know about pods that are terminating that have been force deleted (which activePods() does not include). @Random-Liu I think that components like the cpu_manager, memory_manager, qos_container_manager, and device_manager actually need to know pods that have been force deleted but are still running. To do that, I'm going to need to enable the pod worker to share that info safely (and consistently, i think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you suggest here is necessary. As you mentioned in your comment here, all we should really need to care about is whether the pod is:
(1) in the active pods list; or
(2) some pod currently waiting to be admitted
Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient. If it doesn't make it into the active list, then we shouldn't be tracking it anyway, and if it's ever removed from the list, we should be done tracking it.
We just then need to track a single variable to hold the "currently being admitting" pod that gets overwritten each time a new iteration of the admit loop starts.
The check in removeStaleState()
would then remove any state not currently associated with a pod in the active list + this new variable.
Of course this is all contingent on what I said before:
Assuming logic exists to atomically add an admitted pod to the active pod list before starting the next iteration of the admission loop, I believe this should be sufficient.
Maybe there's good reason not to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) in the active pods list; or
The bug is that the active pods list does not include force deleted pods that may take tens of minutes to complete that may still have running containers that are pinned to certain CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a problem though? I would think we can safely remove the CPUs assigned to them then, and the next time around the reconcile loop, their containers will be moved onto non-exclusive CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, I guess if activePods() used to include terminated (but still running) pods before, then we never would have removed stale state about them here, making us „think“ their CPUs were free (even though containers are still running on them).
AFAIU/CT this is the most pressing question now. I fully agree about the general idea to restore the old (and, we learned, not covering some important cases) behaviour and iterating later to actually cover them. The former is important and urgent, the later important as well but less urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActivePods() never returned pods that were force deleted. So you have always been broken on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, can we restore at least the old behavior under the current PR, and once we will have the infrastructure to get force deleted pods we can improve the logic under resource managers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue to monitor force deleted pods problem - #104099
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this does not block this change, and probably the fix for 104099 should make the code changes here simpler (but we don't have to do it now).
/assign @klueska |
@@ -430,3 +452,19 @@ func (m *manager) GetAllocatableMemory() []state.Block { | |||
func (m *manager) GetMemory(podUID, containerName string) []state.Block { | |||
return m.state.GetMemoryBlocks(podUID, containerName) | |||
} | |||
|
|||
func (m *manager) updateAdmittedPods(pod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: should we try to generalize this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to generalize it, but it is a small chunk of code so for now, I do not want to overcomplicate it with additional structures and interfaces.
We have some additional places that we can generalize and it is worth creating a separate PR for all generalization changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/priority important-soon
/triage accepted
Thanks for the information, I will do it. |
…f-#103979-release-1.22 Automated cherry pick of #103979: cpu manager: do not clean admitted pods from the state
Note that I have a PR I'll open soon that correctly accounts for all admitted and still running pods from the pod worker (I.e. any force deleted pod may still be terminating, so GetActivePods is incorrect today). When that lands it will fix the problem mentioned in the review threads where GetActivePods() is currently not broad enough. |
Great, thanks for the update. |
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
Signed-off-by: likakuli <1154584512@qq.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
We have a e2e test which tries to ensure device plugin assignments to pods are kept across node reboots. And this tests is permafailing since many weeks at time of writing (xref: kubernetes#128443). Problem is: closer inspection reveals the test was well intentioned, but puzzling: The test runs a pod, then restarts the kubelet, then _expects the pod to end up in admission failure_ and yet _ensure the device assignment is kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97 A reader can legitmately wonder if this means the device will be kept busy forever? This is not the case, luckily. The test however embodied the behavior at time of the kubelet, in turn caused by kubernetes#103979 Device manager used to record the last admitted pod and forcibly added to the list of active pod. The retention logic had space for exactly one pod, the last which attempted admission. This retention prevented the cleanup code (see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549 compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549) to clear the registration, so the device was still (mis)reported allocated to the failed pod. This fact was in turn leveraged by the test in question: the test uses the podresources API to learn about the device assignment, and because of the chain of events above the pod failed admission yet was still reported as owning the device. What happened however was the next pod trying admission would have replaced the previous pod in the device manager data, so the previous pod was no longer forced to be added into the active list, so its assignment were correctly cleared once the cleanup code runs; And the cleanup code is run, among other things, every time device manager is asked to allocated devices and every time podresources API queries the device assignment Later, in PR kubernetes#120661 the forced retention logic was removed from all the resource managers, thus also from device manager, and this is what caused the permafailure. Because all of the above, it should be evident that the e2e test was actually enforcing a very specific and not really work-as-intended behavior, which was also overall quite puzzling for users. The best we can do is to fix the test to record and ensure that pods which did fail admission _do not_ retain device assignment. Unfortunately, we _cannot_ guarantee the desirable property that pod going running retain their device assignment across node reboots. In the kubelet restart flow, all pods race to be admitted. There's no order enforced between device plugin pods and application pods. Unless an application pod is lucky enough to _lose_ the race with both the device plugin (to go running before the app pod does) and _also_ with the kubelet (which needs to set devices healthy before the pod tries admission). Signed-off-by: Francesco Romani <fromani@redhat.com>
What type of PR is this?
/kind bug
/kind regression
What this PR does / why we need it:
Do not remove admitted pods from the state until the pod passed the admission phase.
Which issue(s) this PR fixes:
Fixes #103952
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Signed-off-by: Artyom Lukianov alukiano@redhat.com