-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2837: Pod level resource limits #1592
Conversation
Welcome @liorokman! |
Hi @liorokman. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @dchen1107 |
Kind reminder - can anyone review this? |
@liorokman the PR seems to have carried along a number of other KEPs. Can you update the PR to just have the KEP presented? Are you able to present at a future SIG Node meeting the high level overview of the proposal? |
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the motivation for the feature.
Have you thought about the implications of this feature with vertical pod autoscaling?
Have you thought about how we could map this pod bounding behavior potentially to a RuntimeClass or kubelet configuration rather expose the knob directly in the end-user pod spec? I am not sure the average end-user would understand shareBurstableLimits
.
|
||
This proposal aims to: | ||
|
||
* Allow a `Burstable` pod to define that memory and CPU resource limits should be set on the level of the Pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same arguments for the Burstable tier would apply to the Guaranteed tier. I could see users that do not want to specify explicit requests/limits for each container, but just want to bound the pod with fixed resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current behavior for the Guaranteed tier is already exactly this. A pod is in the guaranteed QoS if and only if it always has a request equal to the limit for all the containers in the pod. If this is the case, the current code already sets the pod level cgroup to the sum of the limits for each of the resources.
But because in the Guaranteed case all of the containers also have a limit set to them, setting the pod level cgroup to any limit is meaningless. The pod level cgroup limit is by definition greater than any individual container, which means that the individual container limit would apply before the pod level limit would apply.
A pod that wants to set pod-level fixed resources and not container level resources is by definition not in the Guaranteed
QoS level.
<pre> | ||
QoS CGroup (one of guaranteed, burstable, or besteffort) | ||
| | ||
\ pod (memory: unlimited, cpu: unlimited) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, cpu is not unlimited... it is setting shares, just quota is unbounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I will fix the KEP.
<pre> | ||
QoS CGroup (one of guaranteed, burstable, or besteffort) | ||
| | ||
\ pod (memory: 256M limit, CPU: 1 core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the biggest benefit for this is when its difficult to size two containers that may individually write to a memory backed volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pod resource overhead feature is in flight, it would be good to combine this example with a pod that has a runtimeclass that has an overhead. i also wonder as i type this if its better to associate an attribute on the runtime class to say if the pod should be bounded or not rather than change the pod spec itself. have you explored this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to the pod resource overhead feature? I'm not familiar with this.
Intuitively the pod level seems to me to be a better fit since this sort of resource allocation is not dependent on any specific runtime, and has no meaning outside of the specific Burstable
QoS level. But I'd love to read about the pod resource overhead feature, and give it some thought...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are waiting for merge of the updated Beta docs, I'd suggest checking out the PR: https://github.com/kubernetes/website/pull/19059/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll go over this.
Done. |
I'd be happy to. When is the next SIG Node meeting? |
If this feature is turned on, the VPA could be set to update the budget of one single container in the pod, and also set to monitor the pod-level cgroup for actual maximum resource consumption. This would make the VPA able to also work correctly for pods with this behavior enabled. In effect, the VPA would work the same, except that it would scale the memory for the entire pod instead of working on the container level. By allowing the deployment to specify a resource budget for the entire pod and having the Linux kernel play with the details of allowing each container to allocate what it needs, when it needs it, the end result is a more flexible arrangement. This is a win-win scenario:
I see this feature as specific configuration for a specific pod, not as a general configuration that applies to all of the pods running on the node. I'm not sure how it would map to Maybe a different attribute name would be clearer? Maybe |
/cc |
b546008
to
73a9b6b
Compare
@n4j - you want to take it over and get it merged as provisional? |
|
||
## Summary | ||
|
||
Pod resources are currently enforced on a container-by-container case. Each container defines a limit for each managed resource, and the Kubelet translates these limits into the appropriate cgroup definitions. Kubelet creates a three-level deep hierarchy of cgroups for each pod. The top-level is the QoS grouping of the pod (`Guaranteed`, `Burstable`, and `BestEffort`), the second level is the pod itself, and the bottom level are the pod containers. The current Pod API doesn't enable developers to define resource limits at the pod level. This KEP proposes a method for developers to define resource requests and limits on the pod level in addition to the resource requests and limits currently possible on the individual container level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have the phrase "the Kubelet translates these limits into the appropriate cgroup definitions". For runtimes like kata-containers, there is an additional level of indirection, because each pod is a VM and the cgroups (when they exist) are actually inside the VM.
@thockin Can this be provisionally merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n4j, I can't approve the PR as it stands - you need to at least set the metadata properly. Also please make sure to use the most up to date KEP template
I just skimmed today, but you could also fix the open comments or put them as UNRESOLVED so we don't lose them
to the resource limit. The current implementation considers a pod to be `Burstable` if at least one container defines resource requests or limits. Any | ||
pods not matching the `Guaranteed` or `Burstable` QoS definitions are placed in the `BestEffort` QoS level. | ||
|
||
This proposal suggests modifying these definitions the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is awkwardly worded. I think we can find a better way to express it. Something like:
"""
For each core resource type (cpu, memory) a pod may have a pod-scoped resource request and/or limit. Pods which have a defined request or limit for a given resource type are said to "have explicit resources" or "be explicitly resourced". Pods which DO NOT have a defined request or limit for a given resource type are said to "have implicit resources" or "be implicitly resourced".
A pod is considered Guaranteed
if, for each core resource type, either of the following is true:
- It is explicitly resourced for that resource type AND the the pod's request is equal to its limit.
- It is implicitly resourced for that resource type AND all containers' requests are equal to their limits.
A pod is considered Burstable
if, for each core resource type, any of the following is true:
- It meets the criteria for "Guaranteed" for that resource
- It is explicitly resourced for that resource type AND the pod's request is specified and less than or equal to its limit.
- It is implicitly resourced for that resource type AND all containers' requests are specified and less than or equal to their limits.
A pod is considered BestEffort
if, for each core resource type:
- It is implicitly resourced for that resource type AND any container's request is not specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe simplify more. Could we say that:
"""
For each core resource type, a Pod's resources are either (evaluated in order):
- the values specified in the pod's resources stanza
- the sum of the values specified in all of the pod's containers' resources stanzas, where a single unspecified value means the result is unspecified
A pod is considered guaranteed
if, for each core resource type, the pod's request equals the limit.
"""
owning-sig: sig-node | ||
participating-sigs: | ||
- sig-api-machinery | ||
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set this
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
than the sum of limit for all the containers, then the limit requested by the containers is unreachable. It becomes impossible for all of the containers to | ||
reach their requested limit, since the sum of available memory becomes the pod-level limit. Note that even though setting `pod.limit[memory] < sum(pod.containers[*].limit[memory])` | ||
would not allow all of the containers to allocate memory up to their limit, this is not considered an error by this proposal (though a warning should be emitted). | ||
If it were considered an error, then any admission controller that injects containers to pods with defined container-level limits might cause the pod to fail in unexpected ways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I think the right behavior of these admission controllers would be to add the resources of the injected containers to the pod-level resources.
If a there are resource request definition on the pod level, even without any definition on any of the included containers, then it should be considered | ||
`Burstable` since there is at least some knowledge as to the amount of resources the pod should receive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do container-level resource requests mean if pod-level requests are set? Are they completely ignored in that case? If so, should that be an error case?
EDIT: Actually, I don't think it should be an error case to account for injected sidecars. Maybe a warning though.
1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well. | ||
1. Allow specifying pod-level restrictions separately for Init containers and sidecars. | ||
|
||
Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. I think we should go with the second option for consistency (apply pod-level cgroup to init containers). Consider the case where pod-level limits are defined, and the user didn't bother to set container-level limits. I would expect the pod limits to cap the init containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair I agree with you from the UX point of view.
One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.
I'd be happy to see this topic come bck to life, but we probably need to re-evaluate all preconditions - a lot of things hve changed since it was written :) |
|
||
A pod is considered `Guaranteed` if either of the following two conditions apply: | ||
1. All of the containers included in the pod define memory and CPU resource requests equal to resource limits | ||
1. The pod includes a pod-level memory and CPU resource request which is equal to the resource limit, even if one or more of the included containers doesn't define either a CPU or a memory request or limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we address how will pod level resource limits affect ephemeral containers?
Conversely if `pod.limit[memory] > sum(pod.containers[*].limit[memory])` then the excess is not relevant since the individual containers | ||
can never use more than their defined limit, and even if all of the containers use the maximum allowed by their defined memory limit, it can never amount | ||
to the pod-level limit. Setting such a limit on the pod level does not _reserve_ any memory to the pod and will not negatively affect the rest of the | ||
workloads running on the node. Therefore this is not considered an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be okay (not worth a warning) if we account for ephemeral containers
|
||
### User Stories | ||
|
||
#### Story 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another use case is to allow better resource utilization since we don't need to specify the limits (i.e., peak usage) for each individual containers. Instead, we do this at the pod level, which allows more flexibility through sharing.
1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well. | ||
1. Allow specifying pod-level restrictions separately for Init containers and sidecars. | ||
|
||
Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair I agree with you from the UX point of view.
One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.
The current implementation (before this proposal) sets up the pod-level cgroup using the maximum between the sum of all the sidecars or the biggest limit for any single Init container. | ||
|
||
There are a few options: | ||
1. Limit the pod-level cgroup only after the `InitContainers` have successfully terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the behavior if resources are not specified for the init containers?
- Also, wouldn't this complicate the implementation significantly?
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
…-team-lead add suleymanakbas91 as team lead for edge enablement
will this be worked on in 1.31? please let me know so I can mark the KEP #1592 with the correct metadata. |
@@ -0,0 +1,48 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also add a keps/prod-readiness/sig-auth/1592.yaml
file with:
kep-number:1592
alpha:
approver: "@jpbetz"
? Thanks!
- 2020-06-04 - Updating the KEP after a trial-run on some real-world workloads | ||
- 2020-09-22 - Updating the KEP to allow pod-level resource requests | ||
- 2020-10-07 - Clarified how container-level memory and cpu limits are expected to relate to defined pod limits | ||
- 2020-11-23 - Updated the KEP to the new KEP format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still targeting 1.31 for alpha? If so, please copy in the Production Readiness Review Questionnaire section from the template and fill out the alpha sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please sign me up
/close Closing this PR in favor of #4678 to avoid confusion |
@ndixita: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
keps/sig-node: add KEP for shared-burstable-limits - pod level resource limits