Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2837: Pod level resource limits #1592

Closed
wants to merge 19 commits into from

Conversation

liorokman
Copy link

keps/sig-node: add KEP for shared-burstable-limits - pod level resource limits

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 4, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @liorokman!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @liorokman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 4, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 4, 2020
@liorokman
Copy link
Author

/assign @dchen1107

@liorokman
Copy link
Author

Kind reminder - can anyone review this?

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 6, 2020
@derekwaynecarr
Copy link
Member

@liorokman the PR seems to have carried along a number of other KEPs. Can you update the PR to just have the KEP presented? Are you able to present at a future SIG Node meeting the high level overview of the proposal?

@derekwaynecarr
Copy link
Member

/assign

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation for the feature.

Have you thought about the implications of this feature with vertical pod autoscaling?
Have you thought about how we could map this pod bounding behavior potentially to a RuntimeClass or kubelet configuration rather expose the knob directly in the end-user pod spec? I am not sure the average end-user would understand shareBurstableLimits .


This proposal aims to:

* Allow a `Burstable` pod to define that memory and CPU resource limits should be set on the level of the Pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same arguments for the Burstable tier would apply to the Guaranteed tier. I could see users that do not want to specify explicit requests/limits for each container, but just want to bound the pod with fixed resources.

Copy link
Author

@liorokman liorokman Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior for the Guaranteed tier is already exactly this. A pod is in the guaranteed QoS if and only if it always has a request equal to the limit for all the containers in the pod. If this is the case, the current code already sets the pod level cgroup to the sum of the limits for each of the resources.

But because in the Guaranteed case all of the containers also have a limit set to them, setting the pod level cgroup to any limit is meaningless. The pod level cgroup limit is by definition greater than any individual container, which means that the individual container limit would apply before the pod level limit would apply.

A pod that wants to set pod-level fixed resources and not container level resources is by definition not in the Guaranteed QoS level.

<pre>
QoS CGroup (one of guaranteed, burstable, or besteffort)
|
\ pod (memory: unlimited, cpu: unlimited)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, cpu is not unlimited... it is setting shares, just quota is unbounded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I will fix the KEP.

<pre>
QoS CGroup (one of guaranteed, burstable, or besteffort)
|
\ pod (memory: 256M limit, CPU: 1 core)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the biggest benefit for this is when its difficult to size two containers that may individually write to a memory backed volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pod resource overhead feature is in flight, it would be good to combine this example with a pod that has a runtimeclass that has an overhead. i also wonder as i type this if its better to associate an attribute on the runtime class to say if the pod should be bounded or not rather than change the pod spec itself. have you explored this idea?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the pod resource overhead feature? I'm not familiar with this.

Intuitively the pod level seems to me to be a better fit since this sort of resource allocation is not dependent on any specific runtime, and has no meaning outside of the specific Burstable QoS level. But I'd love to read about the pod resource overhead feature, and give it some thought...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are waiting for merge of the updated Beta docs, I'd suggest checking out the PR: https://github.com/kubernetes/website/pull/19059/files

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll go over this.

keps/sig-node/20200303-shared-pod-limits.md Outdated Show resolved Hide resolved
@liorokman
Copy link
Author

liorokman commented Mar 6, 2020

@liorokman the PR seems to have carried along a number of other KEPs. Can you update the PR to just have the KEP presented?

Done.

@liorokman
Copy link
Author

Are you able to present at a future SIG Node meeting the high level overview of the proposal?

I'd be happy to. When is the next SIG Node meeting?

@liorokman
Copy link
Author

Have you thought about the implications of this feature with vertical pod autoscaling?

If this feature is turned on, the VPA could be set to update the budget of one single container in the pod, and also set to monitor the pod-level cgroup for actual maximum resource consumption. This would make the VPA able to also work correctly for pods with this behavior enabled. In effect, the VPA would work the same, except that it would scale the memory for the entire pod instead of working on the container level.

By allowing the deployment to specify a resource budget for the entire pod and having the Linux kernel play with the details of allowing each container to allocate what it needs, when it needs it, the end result is a more flexible arrangement. This is a win-win scenario:

  • the developer doesn't need to micro-manage container limits but is not forced to grant unlimited resources to critical containers,
  • the application has less disruption because the VPA isn't evicting the pod to fine-tune the specific container limits,
  • the node is less susceptible to the noisy-neighbor problem with pods in the Burstable QoS level (also making the scheduler placement decisions better for this same reason).

Have you thought about how we could map this pod bounding behavior potentially to a RuntimeClass or kubelet configuration rather expose the knob directly in the end-user pod spec? I am not sure the average end-user would understand shareBurstableLimits

I see this feature as specific configuration for a specific pod, not as a general configuration that applies to all of the pods running on the node. I'm not sure how it would map to RuntimeClass either.

Maybe a different attribute name would be clearer? Maybe podResourceSharing instead of shareBurstableLimits?

@dashpole
Copy link
Contributor

dashpole commented Mar 6, 2020

/cc

@k8s-ci-robot k8s-ci-robot requested a review from dashpole March 6, 2020 19:17
@liorokman liorokman force-pushed the shared-burstable-limits branch from b546008 to 73a9b6b Compare March 8, 2020 13:22
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 8, 2020
@thockin
Copy link
Member

thockin commented Jan 16, 2023

@n4j - you want to take it over and get it merged as provisional?


## Summary

Pod resources are currently enforced on a container-by-container case. Each container defines a limit for each managed resource, and the Kubelet translates these limits into the appropriate cgroup definitions. Kubelet creates a three-level deep hierarchy of cgroups for each pod. The top-level is the QoS grouping of the pod (`Guaranteed`, `Burstable`, and `BestEffort`), the second level is the pod itself, and the bottom level are the pod containers. The current Pod API doesn't enable developers to define resource limits at the pod level. This KEP proposes a method for developers to define resource requests and limits on the pod level in addition to the resource requests and limits currently possible on the individual container level.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have the phrase "the Kubelet translates these limits into the appropriate cgroup definitions". For runtimes like kata-containers, there is an additional level of indirection, because each pod is a VM and the cgroups (when they exist) are actually inside the VM.

@vinaykul vinaykul mentioned this pull request Feb 1, 2023
@n4j
Copy link
Member

n4j commented Mar 2, 2023

@n4j - you want to take it over and get it merged as provisional?

@thockin yes

@n4j
Copy link
Member

n4j commented Mar 30, 2023

@thockin Can this be provisionally merged?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n4j, I can't approve the PR as it stands - you need to at least set the metadata properly. Also please make sure to use the most up to date KEP template

I just skimmed today, but you could also fix the open comments or put them as UNRESOLVED so we don't lose them

to the resource limit. The current implementation considers a pod to be `Burstable` if at least one container defines resource requests or limits. Any
pods not matching the `Guaranteed` or `Burstable` QoS definitions are placed in the `BestEffort` QoS level.

This proposal suggests modifying these definitions the following:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is awkwardly worded. I think we can find a better way to express it. Something like:

"""

For each core resource type (cpu, memory) a pod may have a pod-scoped resource request and/or limit. Pods which have a defined request or limit for a given resource type are said to "have explicit resources" or "be explicitly resourced". Pods which DO NOT have a defined request or limit for a given resource type are said to "have implicit resources" or "be implicitly resourced".

A pod is considered Guaranteed if, for each core resource type, either of the following is true:

  • It is explicitly resourced for that resource type AND the the pod's request is equal to its limit.
  • It is implicitly resourced for that resource type AND all containers' requests are equal to their limits.

A pod is considered Burstable if, for each core resource type, any of the following is true:

  • It meets the criteria for "Guaranteed" for that resource
  • It is explicitly resourced for that resource type AND the pod's request is specified and less than or equal to its limit.
  • It is implicitly resourced for that resource type AND all containers' requests are specified and less than or equal to their limits.

A pod is considered BestEffort if, for each core resource type:

  • It is implicitly resourced for that resource type AND any container's request is not specified

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe simplify more. Could we say that:

"""
For each core resource type, a Pod's resources are either (evaluated in order):

  • the values specified in the pod's resources stanza
  • the sum of the values specified in all of the pod's containers' resources stanzas, where a single unspecified value means the result is unspecified

A pod is considered guaranteed if, for each core resource type, the pod's request equals the limit.
"""

owning-sig: sig-node
participating-sigs:
- sig-api-machinery
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set this

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@kad
Copy link
Member

kad commented Jan 23, 2024

/cc @kad
/cc @marquiz

@k8s-ci-robot k8s-ci-robot requested review from kad and marquiz January 23, 2024 18:56
than the sum of limit for all the containers, then the limit requested by the containers is unreachable. It becomes impossible for all of the containers to
reach their requested limit, since the sum of available memory becomes the pod-level limit. Note that even though setting `pod.limit[memory] < sum(pod.containers[*].limit[memory])`
would not allow all of the containers to allocate memory up to their limit, this is not considered an error by this proposal (though a warning should be emitted).
If it were considered an error, then any admission controller that injects containers to pods with defined container-level limits might cause the pod to fail in unexpected ways.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I think the right behavior of these admission controllers would be to add the resources of the injected containers to the pod-level resources.

Comment on lines +172 to +173
If a there are resource request definition on the pod level, even without any definition on any of the included containers, then it should be considered
`Burstable` since there is at least some knowledge as to the amount of resources the pod should receive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do container-level resource requests mean if pod-level requests are set? Are they completely ignored in that case? If so, should that be an error case?

EDIT: Actually, I don't think it should be an error case to account for injected sidecars. Maybe a warning though.

1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well.
1. Allow specifying pod-level restrictions separately for Init containers and sidecars.

Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I think we should go with the second option for consistency (apply pod-level cgroup to init containers). Consider the case where pod-level limits are defined, and the user didn't bother to set container-level limits. I would expect the pod limits to cap the init containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair I agree with you from the UX point of view.

One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.

@thockin
Copy link
Member

thockin commented Feb 7, 2024

I'd be happy to see this topic come bck to life, but we probably need to re-evaluate all preconditions - a lot of things hve changed since it was written :)


A pod is considered `Guaranteed` if either of the following two conditions apply:
1. All of the containers included in the pod define memory and CPU resource requests equal to resource limits
1. The pod includes a pod-level memory and CPU resource request which is equal to the resource limit, even if one or more of the included containers doesn't define either a CPU or a memory request or limit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we address how will pod level resource limits affect ephemeral containers?

Conversely if `pod.limit[memory] > sum(pod.containers[*].limit[memory])` then the excess is not relevant since the individual containers
can never use more than their defined limit, and even if all of the containers use the maximum allowed by their defined memory limit, it can never amount
to the pod-level limit. Setting such a limit on the pod level does not _reserve_ any memory to the pod and will not negatively affect the rest of the
workloads running on the node. Therefore this is not considered an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be okay (not worth a warning) if we account for ephemeral containers


### User Stories

#### Story 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another use case is to allow better resource utilization since we don't need to specify the limits (i.e., peak usage) for each individual containers. Instead, we do this at the pod level, which allows more flexibility through sharing.

1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well.
1. Allow specifying pod-level restrictions separately for Init containers and sidecars.

Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair I agree with you from the UX point of view.

One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.

The current implementation (before this proposal) sets up the pod-level cgroup using the maximum between the sum of all the sidecars or the biggest limit for any single Init container.

There are a few options:
1. Limit the pod-level cgroup only after the `InitContainers` have successfully terminated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What's the behavior if resources are not specified for the init containers?
  • Also, wouldn't this complicate the implementation significantly?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2024
@thockin thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 25, 2024
RomanBednar pushed a commit to RomanBednar/enhancements that referenced this pull request Apr 5, 2024
…-team-lead

add suleymanakbas91 as team lead for edge enablement
@SergeyKanzhelev
Copy link
Member

will this be worked on in 1.31? please let me know so I can mark the KEP #1592 with the correct metadata.

@yujuhong
Copy link
Contributor

will this be worked on in 1.31? please let me know so I can mark the KEP #1592 with the correct metadata.

@ndixita is planning to work on this in 1.31.

@@ -0,0 +1,48 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also add a keps/prod-readiness/sig-auth/1592.yaml file with:

kep-number:1592
alpha:
  approver: "@jpbetz"

? Thanks!

- 2020-06-04 - Updating the KEP after a trial-run on some real-world workloads
- 2020-09-22 - Updating the KEP to allow pod-level resource requests
- 2020-10-07 - Clarified how container-level memory and cpu limits are expected to relate to defined pod limits
- 2020-11-23 - Updated the KEP to the new KEP format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still targeting 1.31 for alpha? If so, please copy in the Production Readiness Review Questionnaire section from the template and fill out the alpha sections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpbetz #4678 is the revised version of this KEP that is being worked on actively. May I please add you as the PRR reviewer since John might be busy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please sign me up

@ndixita
Copy link
Contributor

ndixita commented Jun 6, 2024

/close

Closing this PR in favor of #4678 to avoid confusion

@k8s-ci-robot
Copy link
Contributor

@ndixita: Closed this PR.

In response to this:

/close

Closing this PR in favor of #4678 to avoid confusion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.