- Release Signoff Checklist
- Latest Update [Stalled]
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Work around Memory QoS has been halted because of the issues uncovered during the beta promotion process in K8s 1.28. This section is added to document the valuable lessons learned from this experience. Note: Kubernetes 1.28 did not receive the beta promotion.
Initial Plan: Use cgroup v2 memory.high knob to set memory throttling limit. As per the initial understanding,
setting memory.high would have caused memory allocation to be slowed down once the memory usage level in the containers
reached memory.high
level. When memory usage keeps goes beyond memory.max, kernel will trigger OOM Kill.
Actual Finding: According to the the test results, it was observed that for a container process trying to allocate large chunks of memory, once the memory.high level is reached, it doesn't progress further and stays stuck indefinitely. Upon investigating further, it was observed that when memory usage within a cgroup reaches the memory.high level, the kernel initiates memory reclaim as expected. However the process gets stuck because its memory consumption rate is faster than what the memory reclaim can recover. This creates a livelock situation where the process rapidly consumes the memory reclaimed by the kernel causing the memory usage to reach memory.high level again, leading to another round of memory reclaimation by the kernel. By increasingly slowing growth in memory usage, it becomes harder and harder for workloads to reach the memory.max intervention point. (Ref: https://lkml.org/lkml/2023/6/1/1300)
Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).
Support memory qos with cgroups v2.
In traditional cgroups v1 implement in Kubernetes, we can only limit cpu resources, such as cpu_shares / cpu_set / cpu_quota / cpu_period
, memory qos has not been implemented yet. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.
- Provide guarantees around memory availability for pod and container memory requests and limits
- Provide guarantees around memory availability for node resource
- Make use of new cgroup v2 memory knobs(
memory.min/memory.high
) for pod and container level cgroup - Make use of new cgroup v2 memory knobs(
memory.min
) for node level cgroup
- Additional qos design
- Support other resources qos
- Consider QOSReserved feature
This proposal uses memory controller of cgroups v2 to support memory qos for guaranteeing pod/container memory requests/limits and node resource.
Currently we only use memory.limit_in_bytes=sum(pod.spec.containers.resources.limits[memory])
with cgroups v1 and memory.max=sum(pod.spec.containers.resources.limits[memory])
with cgroups v2 to limit memory usage. resources.requests[memory]
is not yet used neither by cgroups v1 nor cgroups v2 to protect memory requests. About memory protection, we use oom_scores
to determine order of killing container process when OOM occurs. Besides, kubelet can only reserve memory from node allocatable at node level, there is no other memory protection for node resource.
So there are missing some memory protection, it may cause:
- Pod/Container memory requests can't be fully reserved, page cache is at risk of being recycled
- Pod/Container memory allocation is not well protected, there may occur allocation latency frequently when node memory nearly runs out
- Memory overcommit of container is not throttled, there may increase risk of node memory pressure
- Memory resource of node can't be fully retained and protected
Cgroups v2 introduces a better way to protect and guarantee memory quality.
File | Description |
---|---|
memory.min | memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. We map it to requests.memory . |
memory.max | memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. We map it to limits.memory as consistent with existing memory.limit_in_bytes for cgroups v1. |
memory.low | memory.low is the best-effort memory protection, a "soft guarantee" that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. Not yet considered for now. |
memory.high | memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. We use a formula to calculate memory.high depending on limits.memory/node allocatable memory and a memory throttling factor. |
This proposal sets requests.memory
to memory.min
for protecting container memory requests. limits.memory
is set to memory.max
(this is consistent with existing memory.limit_in_bytes
for cgroups v1, we do nothing because cgroup_v2 has implemented for that).
We also introduce memory.high
for container cgroup to throttle container memory overcommit allocation.
Note: memory.high is set for container-level cgroup, and not for pod-level cgroup. If a container in a pod sees a spike in memory usage, it could result in total pod-level memory usage to reach memory.high level set at pod-level cgroup. This will induce throttling in other containers as the pod-level memory.high was hit. Hence to avoid containers from affecting each other, we set memory.high for only container-level cgroup.
It is based on a formula:
memory.high=(limits.memory or node allocatable memory) * memory throttling factor,
where default value of memory throttling factor is set to 0.8
e.g. If a container has requests.memory=50, limits.memory=100
, and we have a throttling factor of .8, memory.high
would be 80. If a container has no memory limit specified, we substitute limits.memory
for node allocatable memory
and apply the throttling factor of .8 to that value.
It must be ensured that memory.high
is always greater than memory.min
.
Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to --enforce-node-allocatable
and memory.min
will be set properly.
Brief map as follows:
type | memory.min | memory.high |
---|---|---|
container | requests.memory | limits.memory/node allocatable memory * memory throttling factor |
pod | sum(requests.memory) | N/A |
node | pods, kube-reserved, system-reserved | N/A |
The formula for memory.high for container cgroup is modified in Alpha stage of the feature in K8s v1.27. It will be set based on formula:
memory.high=floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize, where default value of memory throttling factor is set to 0.9
Note: If a container has no memory limit specified, we substitute limits.memory
for node allocatable memory
and apply the throttling factor of .9 to that value.
The table below runs over the examples with different values requests.memory and 1Mi pageSize:
limits.memory (1000) | memory throttling factor (0.9) |
---|---|
request 0 | 900 |
request 100 | 910 |
request 200 | 920 |
request 300 | 930 |
request 400 | 940 |
request 500 | 950 |
request 600 | 960 |
request 700 | 970 |
request 800 | 960 |
request 900 | 980 |
request 1000 | 1000 |
Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to --enforce-node-allocatable
and memory.min
will be set properly.
Brief map as follows:
type | memory.min | memory.high |
---|---|---|
container | requests.memory | floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize |
pod | sum(requests.memory) | N/A |
node | n/a | pods, kube-reserved, system-reserved |
The formula for memory.high has changed in K8s v1.27 as the Alpha v1.22 implementation has following problems:
-
It fails to throttle when requested memory is closer to memory limits (or node allocatable) as it results in memory.high being less than requests.memory.
For example, if
requests.memory = 85, limits.memory=100
, and we have a throttling factor of 0.8, then as per the Alpha implementation memory.high = memory throttling factor * limits.memory i.e. memory.high = 80. In this case the level at which throttling is supposed to occur i.e. memory.high is less than requests.memory. Hence there won't be any throttling as the Alpha v1.22 implementation doesn't allow memory.high to be less than requested memory. -
It could result in early throttling putting the processes under early heavy reclaim pressure.
For example,
-
requests.memory
= 800Mimemory throttling factor
= 0.8limits.memory
= 1000MiAs per Alpha v1.22 implementation,
memory.high
= memory throttling factor * limits.memory = 0.8 * 1000Mi = 800MiThis results in early throttling and puts the processed under heavy reclaim pressure at 800Mi memory usage levels. There's a significant difference of 200Mi between the memory throttling limit (800Mi) and memory usage hard limit (1000Mi).
-
requests.memory
= 500Mimemory throttling factor
= 0.6limits.memory
= 1000MiAs per Alpha v1.22 implementation,
memory.high
= memory throttling factor * limits.memory = 0.6 * 1000Mi = 600MiThrottling occurs at 600Mi which is just a 100Mi over the requested memory. There's a significant difference of 400Mi between the memory throttle limit (600Mi) and memory usage hard limit (1000Mi).
-
-
Default throttling factor of 0.8 may be too aggressive for some applications that are latency sensitive and always use memory close to memory limits.
For example, there are some known Java workloads that use 85% of the memory will start to get throttled once this feature is enabled by default. Hence the default 0.8 MemoryThrottlingFactor value may not be a good value for many applications due to inducing throttling too early.
Some more examples to compare memory.high using Alpha v1.22 and Alpha v1.27 are listed below:
Limit 1000Mi Request, factor |
Alpha v1.22: memory.high = memory throttling factor * memory.limit (or node allocatable if memory.limit is not set) | Alpha v1.27: memory.high = floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize assuming 1Mi pageSize |
---|---|---|
request 500Mi, factor 0.6 | 600Mi (very early throttling when memory usage is just 100Mi above requested memory; 400Mi unused) | 800Mi |
request 800Mi, factor 0.6 | no throttling (600 < 800 i.e. memory.high < memory.request => no throttling) | 920Mi |
request 1Gi, factor 0.6 | max | max |
request 500Mi, factor 0.8 | 800Mi (early throttling at 800Mi, when 200Mi is unused) | 900Mi |
request 850Mi, factor 0.8 | no throttling (800 < 850 i.e. memory.high < memory.request => no throttling) | 970Mi |
request 500Gi, factor 0.4 | no throttling (800 < 400 i.e. memory.high < memory.request => no throttling) | 700Mi |
Note: As seen from the examples in the table, the formula used in Alpha v1.27 implementation eliminates the cases of memory.high being less than memory.request. However, it still can result in early throttling if memory throttling factor is set low. Hence, it is recommended to set a high memory throttling factor to avoid early throttling.
In addition to the change in formula for memory.high, we are also adding the support for memory.high to be set as per Quality of Service(QoS) for Pod
classes. Based on user feedback in Alpha v1.22, some users would like to opt-out of MemoryQoS on a per pod basis to ensure there is no early memory throttling. By making user's pods guaranteed, they will be able to do so. Guaranteed pod, by definition, are not overcommitted, so memory.high does not provide significant value.
Following are the different cases for setting memory.high as per QOS classes:
-
Guaranteed Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling.
-
Burstable Burstable pods by their QoS definity require at least one container in the Pod with CPU or memory request or limit set.
Case I: When requests.memory and limits.memory are set, the forumula is used as-is:
memory.high = floor[ (requests.memory + memory throttling factor * (limits.memory - requests.memory)) / pageSize ] * pageSize
Case II. When requests.memory is set, limits.memory is not set, we substitute limits.memory for node allocatable memory in the formula:
memory.high = floor[ (requests.memory + memory throttling factor * (node allocatable memory - requests.memory))/ pageSize ] * pageSize
Case III. When requests.memory is not set and limits.memory is set, we set
requests.memory = 0
in the formula:memory.high = floor[ (memory throttling factor * limits.memory) / pageSize) ] * pageSize
-
BestEffort The pod gets a BestEffort class if limits.memory and requests.memory are not set. We set
requests.memory = 0
and substitute limits.memory for node allocatable memory in the formula:memory.high = floor[ (memoryThrottlingFactor * node allocatable memory) / pageSize) * pageSize
Alternative solutions that were discussed (but not preferred) before finalizing the implementation for memory.high are:
-
Allow customers to set memoryThrottlingFactor for each pod in annotations.
Proposal: Add a new annotation for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
- Pros
- Allows more flexibility.
- Can be quickly implemented.
- Cons
- Customers might not need per pod memoryThrottlingFactor configuration.
- It is too low-level detail to expose to customers.
- Pros
-
Allow customers to set MemoryThrottlingFactor in pod yaml.
Proposal: Add a new field in API for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
- Pros
- Allows more flexibility.
- Cons
- Customers might not need per pod memoryThrottlingFactor configuration.
- API changes take a lot of time, and we might eventually realize that the customers don’t need per pod level setting.
- It is too low-level detail to expose to customers, and it is highly unlikely to get an API approval.
- Pros
[Preferred Alternative]: Considering the cons of the alternatives mentioned above, adding support for memory QoS looks more preferrable over other solutions for following reasons:
- Memory QOS complies with QOS which is a wider known concept.
- It is simple to understand as it requires setting only 1 kubelet configuration for setting memory throttling factor.
- It doesn't involve API changes, and doesn't expose low-level detail to customers.
The feature was planned to be graduated to Beta in v1.28, but was backed out. See the Latest Update [Stalled] section for more details.
Some workloads are sensitive to memory allocation and availability, slight delays may cause service outage. In this case, a mechanism is needed to ensure the quality of memory. We must provide guarantee in both of the following aspects:
- Retain memory requests to reduce allocation latency
- Protect memory requests from being reclaimed
The stability of kubelet node is very important to users. As the key resource of the node, the availability of memory is the key factor for the stability of the node. We should do something to protect node reserved memory.
The Memory Manager is a new component of kubelet ecosystem proposed to enable single-NUMA and multi-NUMA guaranteed memory allocation at topology level. Memory QoS proposal mainly uses cgroups v2 to improve the quality of memory requests, thereby improving the memory qos of Guaranteed
and Burstable
pods and even entire node.
See also https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager
In cgroups v2, memory.low
is designed for best-effort memory protection which is more like "soft guarantee" and won't be reclaimed unless memory can't be reclaimed from any unprotected cgroups. memory.min
is a bit aggressive. It will always retain specified amount of memory and it can be never reclaimed. When requirement is not satisfied, system OOM killer will be invoked.
n/a
The main risk of this proposal is too avoid throttling applications to early.
We intend to mitigate this by (1) setting a memory.high
that is closer to the
limit and (2) only throttling when usage > request.
- Kernel enables cgroups v2 unified hierarchy
- CRI runtime supports cgroups v2 Unified Spec for container level
- Kubelet enables
--enforce-node-allocatable=<pods, kube-reserved, system-reserved>
Set --feature-gates=MemoryQoS=true
to enable the feature.
- If container sets
requests.memory
, we setmemory.min=pod.spec.containers[i].resources.requests[memory]
for container level cgroup - If any containers in pod sets
requests.memory
, we setmemory.min=sum(pod.spec.containers[i].resources.requests[memory])
for pod level cgroup - If container sets
limits.memory
, we setmemory.high=pod.spec.containers[i].resources.limits[memory] * memory throttling factor
for container level cgroup ifmemory.high>memory.min
- If container does't set
limits.memory
, we setmemory.high=node allocatable memory * memory throttling factor
for container level cgroup - If kubelet sets
--cgroups-per-qos=true
, we setmemory.min=sum(pod[i].spec.containers[j].resources.requests[memory])
to make ancestor cgroups propagation effective - There are no changes regarding memory limit, that is
memory.max=memory_limits
(same as existing cgroup v2 implementation)
- If kubelet sets
--enforce-node-allocatable=kube-reserved
,--kube-reserved=[a]
and--kube-reserved-cgroup=[b]
, we setmemory.min=[a]
for node level cgroup[b]
- If kubelet sets
--enforce-node-allocatable=system-reserved
,--system-reserved=[a]
and--system-reserved-cgroup=[b]
, we setmemory.min=[a]
for node level cgroup[b]
- If kubelet sets
--enforce-node-allocatable=pods
, we setmemory.min=sum(pod[i].spec.containers[j].resources.requests[memory])
for kubepods cgroup
New Unified
field will be added in both CRI and QoS Manager for cgroups v2 extra parameters. It is recommended to has same semantics with opencontainers/runtime-spec#1040
- container level:
Unified
added inLinuxContainerResources
- pod/node level:
Unified
added incm.ResourceConfig
Container/Pod:
// Container
/cgroup2/kubepods/pod<UID>/<container-id>/memory.min=pod.spec.containers[i].resources.requests[memory]
/cgroup2/kubepods/pod<UID>/<container-id>/memory.high=(pod.spec.containers[i].resources.limits[memory]/node allocatable memory)*memory throttling factor // Burstable
// Pod
/cgroup2/kubepods/pod<UID>/memory.min=sum(pod.spec.containers[i].resources.requests[memory])
// QoS ancestor cgroup
/cgroup2/kubepods/burstable/memory.min=sum(pod[i].spec.containers[j].resources.requests[memory])
Node:
/cgroup2/kubepods/memory.min=sum(pod[i].spec.containers[j].resources.requests[memory])
/cgroup2/<kube-reserved-cgroup,system-reserved-cgroup>/memory.min=<kube-reserved,system-reserved>
After Kubernetes v1.19, kubelet can identify cgroups v2 and do the convention. Since v1.0.0-rc93, runc supports Unified
to pass through cgroups v2 parameters. So we use this variable to pass memory.min
when cgroups v2 mode is detected.
We need add new field Unified
in CRI api which is basically passthrough for OCI spec Unified field and has same semantics: opencontainers/runtime-spec#1040
type LinuxContainerResources struct {
...
Unified map[string]string `json:"unified,omitempty"
}
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Overall Test plan:
For Alpha
, unit tests were added to test functionality for container/pod/node level cgroup with containerd and CRI-O.
For second alpha iteration, (1.27), we plan to add new E2E node e2e tests to validate the MemoryQoS settings are applied correctly.
pkg/kubelet/cm
:02/09/2023
-65.6
n/a: plan to use node e2e tests (see below)
As part of alpha, we plan to add a new node e2e test to validate that the MemoryQoS settings will be correctly set on both pods as well as node allocatable.
The test will be reside in test/e2e_node
.
- cgroup_v2 is in
Alpha
- Memory QoS is implemented for new feature gate
- Memory QoS is covered by proper tests
- Memory QoS supports containerd, cri-o
- cgroup_v2 is in
Beta
- Metrics and graphs to show the amount of reclaim done on a cgroup as it moves from below-request to above-request to throttling
- Memory QoS is covered by unit and e2e-node tests
- Memory QoS supports containerd, cri-o and dockershim
- Expose memory events e.g. memory.high field of memory.events which can inform how many times memory.high was breached and the cgroup was throttled. https://docs.kernel.org/admin-guide/cgroup-v2.html
- cgroup_v2 is in
GA
- Memory QoS has been in beta for at least 2 releases
- Memory QoS sees use in 3 projects or articles
- Memory QoS is covered by conformance tests
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: MemoryQoS
- Components depending on the feature gate: kubelet
Yes, the kubelet will set memory.min
for Guaranteed and Burstable pod/container level cgroup. It also will set memory.high
for burstable and best effort containers, which may cause memory allocation to be slowed down is the memory usage level in the containers reaches memory.high
level. memory.min
for qos or node level cgroup will be set when --cgroups-per-qos
or --enforce-node-allocatable
is satisfied.
Yes, related cgroups can be rolled back, memory.min/memory.high
will reset to default value.
The kubelet will reconcile memory.min/memory.high
with related cgroups.
Yes, some unit tests are exercised with the feature both enabled and disabled to verify proper behavior in both cases. When enabled, we test memory.min/memory.high
for workloads and node cgroups whether it is proper value. When transitioning from enabled to disabled happens, we verify memory.min/memory.high
whether be reset to default value.
N/A There's no API change involved. MemoryQos is a kubelet level flag, that will be enabled by default in Beta. It doesn't require any special opt-in by the user in their PodSpec.
The kubelet will reconcile memory.min/memory.high
with related cgroups depending on whether the feature gate is enabled or not separately for each node.
Already running workloads will not have memory.min/memory.high
set at Pod level. Only memory.min
will be
set at Node level cgroup when the kubelet restarts. The existing workloads will be impacted only when kernel
isn't able to maintain at least memory.min
level of memory for the non-guaranteed workloads within the
Node level cgroup.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
An operator could run ls /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<SOME_ID>.slice
on a node with cgroupv2 enabled to confirm the presence of memory.min
file which tells us that the feature is in use by the workloads.
- [] Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: Kernel memory events will be available in kubelet logs via cadvisor.
These events will inform about the number of times
memory.min
andmemory.high
levels were breached.
- Details: Kernel memory events will be available in kubelet logs via cadvisor.
These events will inform about the number of times
N/A. Same as when running without this feature.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details: Not a service
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
The container runtime must also support cgroup v2
No, new API calls will be generated.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No, resources like PIDs, sockets, inodes will not be affected. However, additional memory throttling can be experienced which is intended by this feature.
- 2020/03/14: initial proposal
- 2020/05/05: target Alpha to v1.22
- 2023/03/03: target Alpha v2 to v1.27
- 2023/06/14: target Beta to v1.28
The main drawbacks are concerns about unintended memory throttling and additional complexity due to to utilization of several new cgroupv2 based memory controls (i.e memory.low, memory.high, etc).
However, we believe that impact of unintended throttling will be minimized due to a high throttling factor (see above) and the additional complexity is justified due to the additional resource management benefits
Please refer to alternatives mentioned above in the proposal section, which discusses the alternatives and changes from the original alpha design to the newly updated alpha design.
n/a, not new infrastructure is needed, this KEP aims to reuse the existing node e2e jobs and framework.