Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4540: Add CPUManager policy option to restrict reservedSystemCPUs to system daemons and interrupt processing #4540 #4541

Merged
merged 8 commits into from
Oct 10, 2024

Conversation

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 7, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @jingczhang!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @jingczhang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 7, 2024
@k8s-ci-robot k8s-ci-robot requested a review from dchen1107 March 7, 2024 08:35
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 7, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 7, 2024
@jingczhang
Copy link
Contributor Author

Initial PR to get reviewers agreement and guidance. Detailed design will be populated after that.

@bart0sh
Copy link
Contributor

bart0sh commented Mar 13, 2024

/ok-to-test
/cc @ffromani @swatisehgal

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 13, 2024
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 18, 2024
@jingczhang
Copy link
Contributor Author

Completed the KEP writing. Created the initial kubelet PR to get reviewers comments and guidance. More test cases will be added.

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections to a new policy option given that it is self contained and an opt in capability. We need to put some thought into how some corner cases are handled especially when a node is under pressure.

In addition to that, we should explicitly capture that this policy option would work seamlessly with other policy options and have this as an item in our test plan.

Comment on lines 84 to 109
### Risks and Mitigations

The risks of adding this new feature are quite low.
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently with the presence of reservedSystemCPUs under defaultCPUset we have the ability to accommodate best effort and burstable pods on a node even if it is resource constrained (e.g. Guaranteed pods occupying majority of CPUs exclusively). But with this option enabled, we could end up in a lot of evictions.

The eviction manager on the node prioritizes pods based on their QoS Class. We should be okay in case of standalone deployments as the pods could go other nodes that have enough resources but what worries me is the fact that there are a lot of infrastructure components (e.g. Device plugins) deployed as daemon sets that might not necessarily belong to Guaranteed QoS class (I took a quick look at SR-IOV device plugin: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/v3.6.2/deployments/sriovdp-daemonset.yaml#L49-L54, we can perhaps find other such examples). We could end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods on the node by the daemon set controller.

What is our mitigation strategy for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High performance workloads, the typical type that using SRIOV/DPDK, are also typically deployed as Guaranteed QoS, they are not affected when strict-cpu-reservation option is enabled.
You enable 'strict-cpu-reservation' when you have to protect infrastructure services, all the systemd daemons and interrupt processing, from busty workloads. Telco is a typical use case where we do our business. If you don't have this issue, then don't enable 'strict-cpu-reservation' option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add feature interaction test cases with the existing policy options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

Copy link
Contributor

@swatisehgal swatisehgal Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

Exactly my point!

I understand that the intention here is to protect infrastructure services (systemd daemons and interrupt processing) but my comment above was about the potential impact on infrastructure pods deployed using daemonset which might not always belong to Gu QoS Class especially when they are running on a node that has resource contention. Kubernetes doesn't have a way to distinguish between application pods and infrastructure pods.

If we don't want to restrict reservedSystemCPUs for OS processes / systemd services, you can simply not opt-in to this option but daemonsets are often used for deploying infrastructure pods on all the nodes irrespective of the configured Kubelet policy or configured option.

Typically in telco use cases, the goal is to maximize CPU utilization by providing CPUs for exclusive allocation to high performance workloads. I would like us to consider and capture in the KEP what happens to BE and BU pods when high performance workloads have been allocated all the CPUs available (total -reserverd) to them. With the current proposal it sounds like we could we end up with an empty defaultCPUSet which IMO should be avoided.

Does CPU exhaustion on a node result in eviction of such pods (even if they are important infra pods)? Could we end up in a never ending cycle of eviction by Kubelet eviction manager and scheduler reschedulng the pods respawned on the node by the daemon set controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, this is my take of the discussion:

(1) From the scheduler point of view, there is a single set of CPUs to schedule pods, which is the node allocatable = total - reserved. It is at the node level, burstable and best-effort pods are allowed to run on the reserved cores. Yes, it is very common people/(our customers) are surprised that burstable and best-effort pods are allowed to run on the reserved cores since it is against "common" sense. The 'strict-cpu-reservation" feature is effectively removing this discrepancy.

(2) However, there is this concern of no longer be able to "hide" burstable and best-effort pods on the reserved cores, so we will add minSharedCPUs. Introducing minSharedCPUs will impact scheduler because it is a discrepancy that has to be taken care of. This scheduler impact changes this 'strict-cpu-reservation' feature from trivial to at least not trivial.

Please review and correct my understanding.

Your understanding is accurate, thanks for summarizing this.

And, is it acceptable to document the scheduler impact in the alpha version while only do the scheduler code change in beta version? for the sake of simpler feature interaction and thanks to feature flag is by default false in the alpha version

My take would be that correct scheduling behavior is a very critical aspect of this feature and needs to be handled at alpha stage itself. We need the right end to end behavior in case the feature is enabled at alpha stage.

SIG scheduling's opinion should be weighed in as well on this subject.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ffromani thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @swatisehgal, thank you for replying and confirming we are aligned on the feature has no eviction impact.
What other feature impact do you think could involve scheduler? At this moment, I don't see obvious scheduler impact, so I do not have any in the KEP itself. How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about exhaustion of the shared cpu pool and its implications. With this change, can we ever reach a point on which there are no cpus left in the shared pool?

Adding minSharedCPUs (final name TBD) will prevent that, effectively partitioning the CPUs in the node in 3 pools. But the existence of that third pool need to be exposed to enable the scheduler (and the system in general) to make the correct decisions. If nothing else, the CPUs available for exclusive allocation won't be capacity-reserved as it is today, but will be capacity-reserved-minSharedCPUs.

IOW, I don't see (yet?) a way to implement robust system behavior without somehow exposing to the rest of the system the minSharedCPUs pool size existence, which justifies the scheduler changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for replying and confirming it is a valid option to keep the feature off forever (Alpha/Beta/GA).

Since this is not a norm (typically Beta onwards a feature is enabled by default), I would suggest explicitly discussing this during PRR (Performance Readiness Review) stage.

How do I invite scheduler maintainers for comments? I did "/add-sig scheduling". Please advise.

I can see that SIG Scheduling is already cc'd on the KEP, they should triage it soon. If we don't hear from them in a week or so, please give a nudge on slack (https://kubernetes.slack.com/archives/C09TP78DV).

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
@swatisehgal
Copy link
Contributor

Please run make update-toc to ensure that the Table of Contents is updated based on the content of the enhancement.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general direction LGTM but some key details will likely need more work

Comment on lines 84 to 109
### Risks and Mitigations

The risks of adding this new feature are quite low.
It is isolated to a specific policy option within the `CPUManager`, and is protected both by the option itself, as well as the `CPUManagerPolicyAlphaOptions` feature gate (which is disabled by default).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with @swatisehgal . We explored a very similar concept in the not-so distant past and exhausting the burstable pool is something which I believe should not be done lightly. In this case at least we have an option (vs a builtin default behavior) which is a step in the right direction, but I still believe we need some form of protection to prevent exhaustion of the shared cpu pool. We will perhaps need a new option or to overload the semantics of existing options.

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
keps/sig-node/4540-strict-cpu-reservation/kep.yaml Outdated Show resolved Hide resolved
keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
@jingczhang
Copy link
Contributor Author

Updated "RIsks and Mitigations" section for impact on reduced shared pool size. Please review and advise.

@soltysh
Copy link
Contributor

soltysh commented Oct 7, 2024

You may want to look at https://github.com/kubernetes/enhancements/pull/4810/files since both of these cover similar settings in CPUManager, so you can re-use some of the answers here as well.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024
@jingczhang
Copy link
Contributor Author

Hi @soltysh @kad, thank you for your review, I have pushed an update per your comments, please check.

Copy link
Member

@kad kad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small bit to "Non-goals", the rest seems to be ok for me.

@jingczhang
Copy link
Contributor Author

Hi @pacoxu @kad, thank you for the review, I have updated the KEP to include your comments, please check.
54ec392

@jingczhang
Copy link
Contributor Author

jingczhang commented Oct 9, 2024

Hi @soltysh is the KEP all good from PRR point of view? let me know if additional correction is needed. Thank you so much.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are still some inconsistencies which are blocking PRR on this KEP. As previously stated, I'd suggest reading through https://github.com/kubernetes/enhancements/pull/4810/files to get the feel for what these answer should be.

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
@jingczhang
Copy link
Contributor Author

jingczhang commented Oct 9, 2024

Hi @soltysh, I have updated the KEP per your questions, please let me know if additional corrections are needed, really want to catch the v1.32 train. Thank you.
5ae6f80

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
the PRR

keps/sig-node/4540-strict-cpu-reservation/README.md Outdated Show resolved Hide resolved
Co-authored-by: Maciej Szulik <soltysh@gmail.com>
@ffromani
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024
@ffromani
Copy link
Contributor

/assign @mrunalp

@ffromani
Copy link
Contributor

/hold cancel
no real reason to hold anymore

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2024
@ffromani
Copy link
Contributor

@ffromani
Copy link
Contributor

/unassign @mrunalp

@derekwaynecarr
Copy link
Member

/approve
/lgtm

thank you for the simplification @jingczhang @ffromani , this looks great.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, jingczhang, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 4a7f28a into kubernetes:master Oct 10, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.