KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541

jingczhang · 2024-03-07T08:35:04Z

One-line PR description: Adding new KEP
Issue link: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540
Code PR link: KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing kubernetes#127483
HTML quick link: https://github.com/nokia/kubernetes-enhancements/tree/4540-strict-cpu-reservation/keps/sig-node/4540-strict-cpu-reservation

k8s-ci-robot · 2024-03-07T08:35:13Z

Welcome @jingczhang!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-03-07T08:35:13Z

Hi @jingczhang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jingczhang · 2024-03-07T08:38:03Z

Initial PR to get reviewers agreement and guidance. Detailed design will be populated after that.

bart0sh · 2024-03-13T13:37:38Z

/ok-to-test
/cc @ffromani @swatisehgal

jingczhang · 2024-03-18T21:47:34Z

Completed the KEP writing. Created the initial kubelet PR to get reviewers comments and guidance. More test cases will be added.

swatisehgal

No objections to a new policy option given that it is self contained and an opt in capability. We need to put some thought into how some corner cases are handled especially when a node is under pressure.

In addition to that, we should explicitly capture that this policy option would work seamlessly with other policy options and have this as an item in our test plan.

swatisehgal · 2024-03-19T12:24:04Z

keps/sig-node/4540-strict-cpu-reservation/README.md

+### Risks and Mitigations
+
+The risks of adding this new feature are quite low.
+It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
+


Currently with the presence of reservedSystemCPUs under defaultCPUset we have the ability to accommodate best effort and burstable pods on a node even if it is resource constrained (e.g. Guaranteed pods occupying majority of CPUs exclusively). But with this option enabled, we could end up in a lot of evictions.

The eviction manager on the node prioritizes pods based on their QoS Class. We should be okay in case of standalone deployments as the pods could go other nodes that have enough resources but what worries me is the fact that there are a lot of infrastructure components (e.g. Device plugins) deployed as daemon sets that might not necessarily belong to Guaranteed QoS class (I took a quick look at SR-IOV device plugin: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/v3.6.2/deployments/sriovdp-daemonset.yaml#L49-L54, we can perhaps find other such examples). We could end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods on the node by the daemon set controller.

What is our mitigation strategy for this ?

High performance workloads, the typical type that using SRIOV/DPDK, are also typically deployed as Guaranteed QoS, they are not affected when strict-cpu-reservation option is enabled.
You enable 'strict-cpu-reservation' when you have to protect infrastructure services, all the systemd daemons and interrupt processing, from busty workloads. Telco is a typical use case where we do our business. If you don't have this issue, then don't enable 'strict-cpu-reservation' option.

Will add feature interaction test cases with the existing policy options.

I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

Exactly my point!

I understand that the intention here is to protect infrastructure services (systemd daemons and interrupt processing) but my comment above was about the potential impact on infrastructure pods deployed using daemonset which might not always belong to Gu QoS Class especially when they are running on a node that has resource contention. Kubernetes doesn't have a way to distinguish between application pods and infrastructure pods.

If we don't want to restrict reservedSystemCPUs for OS processes / systemd services, you can simply not opt-in to this option but daemonsets are often used for deploying infrastructure pods on all the nodes irrespective of the configured Kubelet policy or configured option.

Typically in telco use cases, the goal is to maximize CPU utilization by providing CPUs for exclusive allocation to high performance workloads. I would like us to consider and capture in the KEP what happens to BE and BU pods when high performance workloads have been allocated all the CPUs available (total -reserverd) to them. With the current proposal it sounds like we could we end up with an empty defaultCPUSet which IMO should be avoided.

Does CPU exhaustion on a node result in eviction of such pods (even if they are important infra pods)? Could we end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods respawned on the node by the daemon set controller?

ok, this is my take of the discussion:

(1) From the scheduler point of view, there is a single set of CPUs to schedule pods, which is the node allocatable = total - reserved. It is at the node level, burstable and best-effort pods are allowed to run on the reserved cores. Yes, it is very common people/(our customers) are surprised that burstable and best-effort pods are allowed to run on the reserved cores since it is against "common" sense. The 'strict-cpu-reservation" feature is effectively removing this discrepancy.

(2) However, there is this concern of no longer be able to "hide" burstable and best-effort pods on the reserved cores, so we will add minSharedCPUs. Introducing minSharedCPUs will impact scheduler because it is a discrepancy that has to be taken care of. This scheduler impact changes this 'strict-cpu-reservation' feature from trivial to at least not trivial.

Please review and correct my understanding.

Your understanding is accurate, thanks for summarizing this.

And, is it acceptable to document the scheduler impact in the alpha version while only do the scheduler code change in beta version? for the sake of simpler feature interaction and thanks to feature flag is by default false in the alpha version

My take would be that correct scheduling behavior is a very critical aspect of this feature and needs to be handled at alpha stage itself. We need the right end to end behavior in case the feature is enabled at alpha stage.

SIG scheduling's opinion should be weighed in as well on this subject.

Hi @ffromani thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).

HI @swatisehgal, thank you for replying and confirming we are aligned on the feature has no eviction impact.
What other feature impact do you think could involve scheduler? At this moment, I don't see obvious scheduler impact, so I do not have any in the KEP itself. How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.

I'm thinking about exhaustion of the shared cpu pool and its implications. With this change, can we ever reach a point on which there are no cpus left in the shared pool?

Adding minSharedCPUs (final name TBD) will prevent that, effectively partitioning the CPUs in the node in 3 pools. But the existence of that third pool need to be exposed to enable the scheduler (and the system in general) to make the correct decisions. If nothing else, the CPUs available for exclusive allocation won't be capacity-reserved as it is today, but will be capacity-reserved-minSharedCPUs.

IOW, I don't see (yet?) a way to implement robust system behavior without somehow exposing to the rest of the system the minSharedCPUs pool size existence, which justifies the scheduler changes.

thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).

Since this is not a norm (typically Beta onwards a feature is enabled by default), I would suggest explicitly discussing this during PRR (Performance Readiness Review) stage.

How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.

I can see that SIG Scheduling is already cc'd on the KEP, they should triage it soon. If we don't hear from them in a week or so, please give a nudge on slack (https://kubernetes.slack.com/archives/C09TP78DV).

keps/sig-node/4540-strict-cpu-reservation/README.md

swatisehgal · 2024-03-19T12:55:48Z

Please run make update-toc to ensure that the Table of Contents is updated based on the content of the enhancement.

ffromani

the general direction LGTM but some key details will likely need more work

ffromani · 2024-03-19T14:06:36Z

keps/sig-node/4540-strict-cpu-reservation/README.md

+### Risks and Mitigations
+
+The risks of adding this new feature are quite low.
+It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).
+


I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

keps/sig-node/4540-strict-cpu-reservation/README.md

keps/sig-node/4540-strict-cpu-reservation/kep.yaml

keps/sig-node/4540-strict-cpu-reservation/README.md

jingczhang · 2024-03-21T01:44:01Z

Updated "RIsks and Mitigations" section for impact on reduced shared pool size. Please review and advise.

soltysh · 2024-10-07T17:30:02Z

You may want to look at https://github.com/kubernetes/enhancements/pull/4810/files since both of these cover similar settings in CPUManager, so you can re-use some of the answers here as well.

jingczhang · 2024-10-08T00:45:23Z

Hi @soltysh @kad, thank you for your review, I have pushed an update per your comments, please check.

keps/sig-node/4540-strict-cpu-reservation/README.md

kad

One small bit to "Non-goals", the rest seems to be ok for me.

keps/sig-node/4540-strict-cpu-reservation/README.md

jingczhang · 2024-10-08T18:49:42Z

Hi @pacoxu @kad, thank you for the review, I have updated the KEP to include your comments, please check.
54ec392

keps/sig-node/4540-strict-cpu-reservation/README.md

jingczhang · 2024-10-09T11:53:10Z

Hi @soltysh is the KEP all good from PRR point of view? let me know if additional correction is needed. Thank you so much.

soltysh

I believe there are still some inconsistencies which are blocking PRR on this KEP. As previously stated, I'd suggest reading through https://github.com/kubernetes/enhancements/pull/4810/files to get the feel for what these answer should be.

keps/sig-node/4540-strict-cpu-reservation/README.md

jingczhang · 2024-10-09T15:26:45Z

Hi @soltysh, I have updated the KEP per your questions, please let me know if additional corrections are needed, really want to catch the v1.32 train. Thank you.
5ae6f80

soltysh

/approve
the PRR

keps/sig-node/4540-strict-cpu-reservation/README.md

Co-authored-by: Maciej Szulik <soltysh@gmail.com>

ffromani · 2024-10-10T11:16:42Z

/lgtm

ffromani · 2024-10-10T11:17:44Z

/assign @mrunalp

ffromani · 2024-10-10T11:18:20Z

/hold cancel
no real reason to hold anymore

ffromani · 2024-10-10T11:28:35Z

/assign @mrunalp

/assign @derekwaynecarr
my bad, misread https://github.com/orgs/kubernetes/projects/186/views/7?filterQuery=

ffromani · 2024-10-10T11:28:45Z

/unassign @mrunalp

derekwaynecarr · 2024-10-10T11:57:06Z

/approve
/lgtm

thank you for the simplification @jingczhang @ffromani , this looks great.

k8s-ci-robot · 2024-10-10T11:57:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, jingczhang, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 7, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 7, 2024

k8s-ci-robot requested a review from dchen1107 March 7, 2024 08:35

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 7, 2024

k8s-ci-robot requested a review from derekwaynecarr March 7, 2024 08:35

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 7, 2024

k8s-ci-robot requested review from ffromani and swatisehgal March 13, 2024 13:37

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024

jingczhang mentioned this pull request Mar 18, 2024

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing kubernetes/kubernetes#123979

Closed

jingczhang mentioned this pull request Mar 19, 2024

Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540

Open

4 tasks

swatisehgal reviewed Mar 19, 2024

View reviewed changes

ffromani reviewed Mar 19, 2024

View reviewed changes

Respond to PRR review comments.

bd48806

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024

pacoxu reviewed Oct 8, 2024

View reviewed changes

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved

kad reviewed Oct 8, 2024

View reviewed changes

keps/sig-node/4540-strict-cpu-reservation/README.md Show resolved Hide resolved

take care of additional comments.

54ec392

pacoxu reviewed Oct 9, 2024

View reviewed changes

keps/sig-node/4540-strict-cpu-reservation/README.md Show resolved Hide resolved

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved

soltysh reviewed Oct 9, 2024

View reviewed changes

Take care of additional PRR questions.

5ae6f80

soltysh approved these changes Oct 9, 2024

View reviewed changes

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved

Update keps/sig-node/4540-strict-cpu-reservation/README.md

afb9d68

Co-authored-by: Maciej Szulik <soltysh@gmail.com>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024

k8s-ci-robot assigned mrunalp Oct 10, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2024

k8s-ci-robot assigned derekwaynecarr Oct 10, 2024

k8s-ci-robot unassigned mrunalp Oct 10, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024

k8s-ci-robot merged commit 4a7f28a into kubernetes:master Oct 10, 2024
4 checks passed

k8s-ci-robot added this to the v1.32 milestone Oct 10, 2024

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541

Conversation

jingczhang commented Mar 7, 2024 • edited Loading

k8s-ci-robot commented Mar 7, 2024

k8s-ci-robot commented Mar 7, 2024

jingczhang commented Mar 7, 2024

bart0sh commented Mar 13, 2024

jingczhang commented Mar 18, 2024

swatisehgal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swatisehgal Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swatisehgal commented Mar 19, 2024

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingczhang commented Mar 21, 2024

soltysh commented Oct 7, 2024

jingczhang commented Oct 8, 2024

kad left a comment

Choose a reason for hiding this comment

jingczhang commented Oct 8, 2024

jingczhang commented Oct 9, 2024 • edited Loading

soltysh left a comment

Choose a reason for hiding this comment

jingczhang commented Oct 9, 2024 • edited Loading

soltysh left a comment

Choose a reason for hiding this comment

ffromani commented Oct 10, 2024

ffromani commented Oct 10, 2024

ffromani commented Oct 10, 2024

ffromani commented Oct 10, 2024

ffromani commented Oct 10, 2024

derekwaynecarr commented Oct 10, 2024

k8s-ci-robot commented Oct 10, 2024

jingczhang commented Mar 7, 2024 •

edited

Loading

swatisehgal Mar 19, 2024 •

edited

Loading

jingczhang commented Oct 9, 2024 •

edited

Loading

jingczhang commented Oct 9, 2024 •

edited

Loading