-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541
KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541
Conversation
jingczhang
commented
Mar 7, 2024
•
edited
Loading
edited
- One-line PR description: Adding new KEP
- Issue link: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540
- Code PR link: KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing kubernetes#127483
- HTML quick link: https://github.com/nokia/kubernetes-enhancements/tree/4540-strict-cpu-reservation/keps/sig-node/4540-strict-cpu-reservation
Welcome @jingczhang! |
Hi @jingczhang. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Initial PR to get reviewers agreement and guidance. Detailed design will be populated after that. |
/ok-to-test |
Completed the KEP writing. Created the initial kubelet PR to get reviewers comments and guidance. More test cases will be added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections to a new policy option given that it is self contained and an opt in capability. We need to put some thought into how some corner cases are handled especially when a node is under pressure.
In addition to that, we should explicitly capture that this policy option would work seamlessly with other policy options and have this as an item in our test plan.
### Risks and Mitigations | ||
|
||
The risks of adding this new feature are quite low. | ||
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently with the presence of reservedSystemCPUs
under defaultCPUset
we have the ability to accommodate best effort and burstable pods on a node even if it is resource constrained (e.g. Guaranteed pods occupying majority of CPUs exclusively). But with this option enabled, we could end up in a lot of evictions.
The eviction manager on the node prioritizes pods based on their QoS Class. We should be okay in case of standalone deployments as the pods could go other nodes that have enough resources but what worries me is the fact that there are a lot of infrastructure components (e.g. Device plugins) deployed as daemon sets that might not necessarily belong to Guaranteed QoS class (I took a quick look at SR-IOV device plugin: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/v3.6.2/deployments/sriovdp-daemonset.yaml#L49-L54, we can perhaps find other such examples). We could end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods on the node by the daemon set controller.
What is our mitigation strategy for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High performance workloads, the typical type that using SRIOV/DPDK, are also typically deployed as Guaranteed QoS, they are not affected when strict-cpu-reservation option is enabled.
You enable 'strict-cpu-reservation' when you have to protect infrastructure services, all the systemd daemons and interrupt processing, from busty workloads. Telco is a typical use case where we do our business. If you don't have this issue, then don't enable 'strict-cpu-reservation' option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add feature interaction test cases with the existing policy options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.
Exactly my point!
I understand that the intention here is to protect infrastructure services (systemd daemons and interrupt processing) but my comment above was about the potential impact on infrastructure pods deployed using daemonset which might not always belong to Gu QoS Class especially when they are running on a node that has resource contention. Kubernetes doesn't have a way to distinguish between application pods and infrastructure pods.
If we don't want to restrict reservedSystemCPUs
for OS processes / systemd services, you can simply not opt-in to this option but daemonsets are often used for deploying infrastructure pods on all the nodes irrespective of the configured Kubelet policy or configured option.
Typically in telco use cases, the goal is to maximize CPU utilization by providing CPUs for exclusive allocation to high performance workloads. I would like us to consider and capture in the KEP what happens to BE and BU pods when high performance workloads have been allocated all the CPUs available (total -reserverd) to them. With the current proposal it sounds like we could we end up with an empty defaultCPUSet which IMO should be avoided.
Does CPU exhaustion on a node result in eviction of such pods (even if they are important infra pods)? Could we end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods respawned on the node by the daemon set controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this is my take of the discussion:
(1) From the scheduler point of view, there is a single set of CPUs to schedule pods, which is the node allocatable = total - reserved. It is at the node level, burstable and best-effort pods are allowed to run on the reserved cores. Yes, it is very common people/(our customers) are surprised that burstable and best-effort pods are allowed to run on the reserved cores since it is against "common" sense. The 'strict-cpu-reservation" feature is effectively removing this discrepancy.
(2) However, there is this concern of no longer be able to "hide" burstable and best-effort pods on the reserved cores, so we will add minSharedCPUs. Introducing minSharedCPUs will impact scheduler because it is a discrepancy that has to be taken care of. This scheduler impact changes this 'strict-cpu-reservation' feature from trivial to at least not trivial.
Please review and correct my understanding.
Your understanding is accurate, thanks for summarizing this.
And, is it acceptable to document the scheduler impact in the alpha version while only do the scheduler code change in beta version? for the sake of simpler feature interaction and thanks to feature flag is by default false in the alpha version
My take would be that correct scheduling behavior is a very critical aspect of this feature and needs to be handled at alpha stage itself. We need the right end to end behavior in case the feature is enabled at alpha stage.
SIG scheduling's opinion should be weighed in as well on this subject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ffromani thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @swatisehgal, thank you for replying and confirming we are aligned on the feature has no eviction impact.
What other feature impact do you think could involve scheduler? At this moment, I don't see obvious scheduler impact, so I do not have any in the KEP itself. How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about exhaustion of the shared cpu pool and its implications. With this change, can we ever reach a point on which there are no cpus left in the shared pool?
Adding minSharedCPUs
(final name TBD) will prevent that, effectively partitioning the CPUs in the node in 3 pools. But the existence of that third pool need to be exposed to enable the scheduler (and the system in general) to make the correct decisions. If nothing else, the CPUs available for exclusive allocation won't be capacity
-reserved
as it is today, but will be capacity
-reserved
-minSharedCPUs
.
IOW, I don't see (yet?) a way to implement robust system behavior without somehow exposing to the rest of the system the minSharedCPUs
pool size existence, which justifies the scheduler changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).
Since this is not a norm (typically Beta onwards a feature is enabled by default), I would suggest explicitly discussing this during PRR (Performance Readiness Review) stage.
How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.
I can see that SIG Scheduling is already cc'd on the KEP, they should triage it soon. If we don't hear from them in a week or so, please give a nudge on slack (https://kubernetes.slack.com/archives/C09TP78DV).
Please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the general direction LGTM but some key details will likely need more work
### Risks and Mitigations | ||
|
||
The risks of adding this new feature are quite low. | ||
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.
Updated "RIsks and Mitigations" section for impact on reduced shared pool size. Please review and advise. |
You may want to look at https://github.com/kubernetes/enhancements/pull/4810/files since both of these cover similar settings in CPUManager, so you can re-use some of the answers here as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small bit to "Non-goals", the rest seems to be ok for me.
Hi @soltysh is the KEP all good from PRR point of view? let me know if additional correction is needed. Thank you so much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are still some inconsistencies which are blocking PRR on this KEP. As previously stated, I'd suggest reading through https://github.com/kubernetes/enhancements/pull/4810/files to get the feel for what these answer should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the PRR
Co-authored-by: Maciej Szulik <soltysh@gmail.com>
/lgtm |
/assign @mrunalp |
/hold cancel |
/assign @derekwaynecarr |
/unassign @mrunalp |
/approve thank you for the simplification @jingczhang @ffromani , this looks great. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, jingczhang, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |