2570-memory-qos

KEP-2570: Support Memory QoS with cgroups v2

Release Signoff Checklist

Latest Update [Stalled]

Work around Memory QoS has been halted because of the issues uncovered during the beta promotion process in K8s 1.28. This section is added to document the valuable lessons learned from this experience. Note: Kubernetes 1.28 did not receive the beta promotion.

Initial Plan: Use cgroup v2 memory.high knob to set memory throttling limit. As per the initial understanding, setting memory.high would have caused memory allocation to be slowed down once the memory usage level in the containers reached memory.high level. When memory usage keeps goes beyond memory.max, kernel will trigger OOM Kill.

Actual Finding: According to the the test results, it was observed that for a container process trying to allocate large chunks of memory, once the memory.high level is reached, it doesn't progress further and stays stuck indefinitely. Upon investigating further, it was observed that when memory usage within a cgroup reaches the memory.high level, the kernel initiates memory reclaim as expected. However the process gets stuck because its memory consumption rate is faster than what the memory reclaim can recover. This creates a livelock situation where the process rapidly consumes the memory reclaimed by the kernel causing the memory usage to reach memory.high level again, leading to another round of memory reclaimation by the kernel. By increasingly slowing growth in memory usage, it becomes harder and harder for workloads to reach the memory.max intervention point. (Ref: https://lkml.org/lkml/2023/6/1/1300)

Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).

Summary

Support memory qos with cgroups v2.

Motivation

In traditional cgroups v1 implement in Kubernetes, we can only limit cpu resources, such as cpu_shares / cpu_set / cpu_quota / cpu_period, memory qos has not been implemented yet. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.

Goals

Provide guarantees around memory availability for pod and container memory requests and limits
Provide guarantees around memory availability for node resource
Make use of new cgroup v2 memory knobs(memory.min/memory.high) for pod and container level cgroup
Make use of new cgroup v2 memory knobs(memory.min) for node level cgroup

Non-Goals

Additional qos design
Support other resources qos
Consider QOSReserved feature

Proposal

This proposal uses memory controller of cgroups v2 to support memory qos for guaranteeing pod/container memory requests/limits and node resource.

Currently we only use memory.limit_in_bytes=sum(pod.spec.containers.resources.limits[memory]) with cgroups v1 and memory.max=sum(pod.spec.containers.resources.limits[memory]) with cgroups v2 to limit memory usage. resources.requests[memory] is not yet used neither by cgroups v1 nor cgroups v2 to protect memory requests. About memory protection, we use oom_scores to determine order of killing container process when OOM occurs. Besides, kubelet can only reserve memory from node allocatable at node level, there is no other memory protection for node resource.

So there are missing some memory protection, it may cause:

Pod/Container memory requests can't be fully reserved, page cache is at risk of being recycled
Pod/Container memory allocation is not well protected, there may occur allocation latency frequently when node memory nearly runs out
Memory overcommit of container is not throttled, there may increase risk of node memory pressure
Memory resource of node can't be fully retained and protected

Cgroups v2 introduces a better way to protect and guarantee memory quality.

File	Description
memory.min	memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. We map it to `requests.memory`.
memory.max	memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. We map it to `limits.memory` as consistent with existing `memory.limit_in_bytes` for cgroups v1.
memory.low	memory.low is the best-effort memory protection, a "soft guarantee" that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. Not yet considered for now.
memory.high	memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. We use a formula to calculate `memory.high` depending on `limits.memory/node allocatable memory` and a memory throttling factor.

This proposal sets requests.memory to memory.min for protecting container memory requests. limits.memory is set to memory.max (this is consistent with existing memory.limit_in_bytes for cgroups v1, we do nothing because cgroup_v2 has implemented for that).

We also introduce memory.high for container cgroup to throttle container memory overcommit allocation. Note: memory.high is set for container-level cgroup, and not for pod-level cgroup. If a container in a pod sees a spike in memory usage, it could result in total pod-level memory usage to reach memory.high level set at pod-level cgroup. This will induce throttling in other containers as the pod-level memory.high was hit. Hence to avoid containers from affecting each other, we set memory.high for only container-level cgroup.

Alpha v1.22

It is based on a formula:

memory.high=(limits.memory or node allocatable memory) * memory throttling factor, 
where default value of memory throttling factor is set to 0.8

e.g. If a container has requests.memory=50, limits.memory=100, and we have a throttling factor of .8, memory.high would be 80. If a container has no memory limit specified, we substitute limits.memory for node allocatable memory and apply the throttling factor of .8 to that value. It must be ensured that memory.high is always greater than memory.min.

Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to --enforce-node-allocatable and memory.min will be set properly.

Brief map as follows:

type	memory.min	memory.high
container	requests.memory	limits.memory/node allocatable memory * memory throttling factor
pod	sum(requests.memory)	N/A
node	pods, kube-reserved, system-reserved	N/A

Alpha v1.27

The formula for memory.high for container cgroup is modified in Alpha stage of the feature in K8s v1.27. It will be set based on formula:

memory.high=floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize, where default value of memory throttling factor is set to 0.9

Note: If a container has no memory limit specified, we substitute limits.memory for node allocatable memory and apply the throttling factor of .9 to that value.

The table below runs over the examples with different values requests.memory and 1Mi pageSize:

limits.memory (1000)	memory throttling factor (0.9)
request 0	900
request 100	910
request 200	920
request 300	930
request 400	940
request 500	950
request 600	960
request 700	970
request 800	960
request 900	980
request 1000	1000

Node reserved resources(kube-reserved/system-reserved) are either considered. It is tied to --enforce-node-allocatable and memory.min will be set properly.

Brief map as follows:

type	memory.min	memory.high
container	requests.memory	floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize
pod	sum(requests.memory)	N/A
node	n/a	pods, kube-reserved, system-reserved

Reasons for changing the formula of memory.high calculation in Alpha v1.27

The formula for memory.high has changed in K8s v1.27 as the Alpha v1.22 implementation has following problems:

It fails to throttle when requested memory is closer to memory limits (or node allocatable) as it results in memory.high being less than requests.memory.

For example, if requests.memory = 85, limits.memory=100, and we have a throttling factor of 0.8, then as per the Alpha implementation memory.high = memory throttling factor * limits.memory i.e. memory.high = 80. In this case the level at which throttling is supposed to occur i.e. memory.high is less than requests.memory. Hence there won't be any throttling as the Alpha v1.22 implementation doesn't allow memory.high to be less than requested memory.
It could result in early throttling putting the processes under early heavy reclaim pressure.

For example,
- requests.memory = 800Mi
  
  memory throttling factor = 0.8
  
  limits.memory = 1000Mi
  
  As per Alpha v1.22 implementation,
  
  memory.high = memory throttling factor * limits.memory = 0.8 * 1000Mi = 800Mi
  
  This results in early throttling and puts the processed under heavy reclaim pressure at 800Mi memory usage levels. There's a significant difference of 200Mi between the memory throttling limit (800Mi) and memory usage hard limit (1000Mi).
- requests.memory = 500Mi
  
  memory throttling factor = 0.6
  
  limits.memory = 1000Mi
  
  As per Alpha v1.22 implementation,
  
  memory.high = memory throttling factor * limits.memory = 0.6 * 1000Mi = 600Mi
  
  Throttling occurs at 600Mi which is just a 100Mi over the requested memory. There's a significant difference of 400Mi between the memory throttle limit (600Mi) and memory usage hard limit (1000Mi).
Default throttling factor of 0.8 may be too aggressive for some applications that are latency sensitive and always use memory close to memory limits.

For example, there are some known Java workloads that use 85% of the memory will start to get throttled once this feature is enabled by default. Hence the default 0.8 MemoryThrottlingFactor value may not be a good value for many applications due to inducing throttling too early.

Some more examples to compare memory.high using Alpha v1.22 and Alpha v1.27 are listed below:

Limit 1000Mi Request, factor	Alpha v1.22: memory.high = memory throttling factor * memory.limit (or node allocatable if memory.limit is not set)	Alpha v1.27: memory.high = floor[(requests.memory + memory throttling factor * (limits.memory or node allocatable memory - requests.memory))/pageSize] * pageSize assuming 1Mi pageSize
request 500Mi, factor 0.6	600Mi (very early throttling when memory usage is just 100Mi above requested memory; 400Mi unused)	800Mi
request 800Mi, factor 0.6	no throttling (600 < 800 i.e. memory.high < memory.request => no throttling)	920Mi
request 1Gi, factor 0.6	max	max
request 500Mi, factor 0.8	800Mi (early throttling at 800Mi, when 200Mi is unused)	900Mi
request 850Mi, factor 0.8	no throttling (800 < 850 i.e. memory.high < memory.request => no throttling)	970Mi
request 500Gi, factor 0.4	no throttling (800 < 400 i.e. memory.high < memory.request => no throttling)	700Mi

Note: As seen from the examples in the table, the formula used in Alpha v1.27 implementation eliminates the cases of memory.high being less than memory.request. However, it still can result in early throttling if memory throttling factor is set low. Hence, it is recommended to set a high memory throttling factor to avoid early throttling.

Quality of Service for Pods

In addition to the change in formula for memory.high, we are also adding the support for memory.high to be set as per Quality of Service(QoS) for Pod classes. Based on user feedback in Alpha v1.22, some users would like to opt-out of MemoryQoS on a per pod basis to ensure there is no early memory throttling. By making user's pods guaranteed, they will be able to do so. Guaranteed pod, by definition, are not overcommitted, so memory.high does not provide significant value.

Following are the different cases for setting memory.high as per QOS classes:

Guaranteed Guaranteed pods by their QoS definition require memory requests=memory limits and are not overcommitted. Hence MemoryQoS feature is disabled on those pods by not setting memory.high. This ensures that Guaranteed pods can fully use their memory requests up to their set limit, and not hit any throttling.
Burstable Burstable pods by their QoS definity require at least one container in the Pod with CPU or memory request or limit set.

Case I: When requests.memory and limits.memory are set, the forumula is used as-is:
```
memory.high = floor[ (requests.memory + memory throttling factor * (limits.memory - requests.memory)) / pageSize ] * pageSize
```
Case II. When requests.memory is set, limits.memory is not set, we substitute limits.memory for node allocatable memory in the formula:
```
memory.high = floor[ (requests.memory + memory throttling factor * (node allocatable memory - requests.memory))/ pageSize ] * pageSize
```
Case III. When requests.memory is not set and limits.memory is set, we set requests.memory = 0 in the formula:
```
memory.high = floor[ (memory throttling factor * limits.memory) / pageSize) ] * pageSize
```
BestEffort The pod gets a BestEffort class if limits.memory and requests.memory are not set. We set requests.memory = 0 and substitute limits.memory for node allocatable memory in the formula:
```
memory.high = floor[ (memoryThrottlingFactor * node allocatable memory) / pageSize) * pageSize
```

Alternative solutions for implementing memory.high

Alternative solutions that were discussed (but not preferred) before finalizing the implementation for memory.high are:

Allow customers to set memoryThrottlingFactor for each pod in annotations.

Proposal: Add a new annotation for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
- Pros
  - Allows more flexibility.
  - Can be quickly implemented.
- Cons
  - Customers might not need per pod memoryThrottlingFactor configuration.
  - It is too low-level detail to expose to customers.
Allow customers to set MemoryThrottlingFactor in pod yaml.

Proposal: Add a new field in API for customers to set memoryThrottlingFactor to override kubelet level memoryThrottlingFactor.
- Pros
  - Allows more flexibility.
- Cons
  - Customers might not need per pod memoryThrottlingFactor configuration.
  - API changes take a lot of time, and we might eventually realize that the customers don’t need per pod level setting.
  - It is too low-level detail to expose to customers, and it is highly unlikely to get an API approval.

[Preferred Alternative]: Considering the cons of the alternatives mentioned above, adding support for memory QoS looks more preferrable over other solutions for following reasons:

Memory QOS complies with QOS which is a wider known concept.
It is simple to understand as it requires setting only 1 kubelet configuration for setting memory throttling factor.
It doesn't involve API changes, and doesn't expose low-level detail to customers.

Beta v1.28 - Cancelled

The feature was planned to be graduated to Beta in v1.28, but was backed out. See the Latest Update [Stalled] section for more details.

User Stories (Optional)

Memory Sensitive Workload

Some workloads are sensitive to memory allocation and availability, slight delays may cause service outage. In this case, a mechanism is needed to ensure the quality of memory. We must provide guarantee in both of the following aspects:

Retain memory requests to reduce allocation latency
Protect memory requests from being reclaimed

Node Availability

The stability of kubelet node is very important to users. As the key resource of the node, the availability of memory is the key factor for the stability of the node. We should do something to protect node reserved memory.

Comparison with Memory Manager

The Memory Manager is a new component of kubelet ecosystem proposed to enable single-NUMA and multi-NUMA guaranteed memory allocation at topology level. Memory QoS proposal mainly uses cgroups v2 to improve the quality of memory requests, thereby improving the memory qos of Guaranteed and Burstable pods and even entire node. See also https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager

memory.low vs memory.min

In cgroups v2, memory.low is designed for best-effort memory protection which is more like "soft guarantee" and won't be reclaimed unless memory can't be reclaimed from any unprotected cgroups. memory.min is a bit aggressive. It will always retain specified amount of memory and it can be never reclaimed. When requirement is not satisfied, system OOM killer will be invoked.

Notes/Constraints/Caveats (Optional)

n/a

Risks and Mitigations

The main risk of this proposal is too avoid throttling applications to early. We intend to mitigate this by (1) setting a memory.high that is closer to the limit and (2) only throttling when usage > request.

Design Details

Prerequisite

Kernel enables cgroups v2 unified hierarchy
CRI runtime supports cgroups v2 Unified Spec for container level
Kubelet enables --enforce-node-allocatable=<pods, kube-reserved, system-reserved>

Feature Gate

Set --feature-gates=MemoryQoS=true to enable the feature.

Mapping Rules

Container/Pod

If container sets requests.memory, we set memory.min=pod.spec.containers[i].resources.requests[memory] for container level cgroup
If any containers in pod sets requests.memory, we set memory.min=sum(pod.spec.containers[i].resources.requests[memory]) for pod level cgroup
If container sets limits.memory, we set memory.high=pod.spec.containers[i].resources.limits[memory] * memory throttling factor for container level cgroup if memory.high>memory.min
If container does't set limits.memory, we set memory.high=node allocatable memory * memory throttling factor for container level cgroup
If kubelet sets --cgroups-per-qos=true, we set memory.min=sum(pod[i].spec.containers[j].resources.requests[memory]) to make ancestor cgroups propagation effective
There are no changes regarding memory limit, that is memory.max=memory_limits (same as existing cgroup v2 implementation)

Node

If kubelet sets --enforce-node-allocatable=kube-reserved, --kube-reserved=[a] and --kube-reserved-cgroup=[b], we set memory.min=[a] for node level cgroup [b]
If kubelet sets --enforce-node-allocatable=system-reserved, --system-reserved=[a] and --system-reserved-cgroup=[b], we set memory.min=[a] for node level cgroup [b]
If kubelet sets --enforce-node-allocatable=pods, we set memory.min=sum(pod[i].spec.containers[j].resources.requests[memory]) for kubepods cgroup

Interactive

New Unified field will be added in both CRI and QoS Manager for cgroups v2 extra parameters. It is recommended to has same semantics with opencontainers/runtime-spec#1040

container level: Unified added in LinuxContainerResources
pod/node level: Unified added in cm.ResourceConfig

Workflow

Container

Pod

QoS

Node

Cgroup Hierarchy

Container/Pod:

// Container
/cgroup2/kubepods/pod<UID>/<container-id>/memory.min=pod.spec.containers[i].resources.requests[memory]
/cgroup2/kubepods/pod<UID>/<container-id>/memory.high=(pod.spec.containers[i].resources.limits[memory]/node allocatable memory)*memory throttling factor // Burstable
// Pod
/cgroup2/kubepods/pod<UID>/memory.min=sum(pod.spec.containers[i].resources.requests[memory])
// QoS ancestor cgroup
/cgroup2/kubepods/burstable/memory.min=sum(pod[i].spec.containers[j].resources.requests[memory])

Node:

/cgroup2/kubepods/memory.min=sum(pod[i].spec.containers[j].resources.requests[memory])
/cgroup2/<kube-reserved-cgroup,system-reserved-cgroup>/memory.min=<kube-reserved,system-reserved>

Cgroup v2 Support

After Kubernetes v1.19, kubelet can identify cgroups v2 and do the convention. Since v1.0.0-rc93, runc supports Unified to pass through cgroups v2 parameters. So we use this variable to pass memory.min when cgroups v2 mode is detected.

Container Runtime Interface (CRI) Changes

We need add new field Unified in CRI api which is basically passthrough for OCI spec Unified field and has same semantics: opencontainers/runtime-spec#1040

type LinuxContainerResources struct {
    ...
    Unified map[string]string `json:"unified,omitempty"
}

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Overall Test plan:

For Alpha, unit tests were added to test functionality for container/pod/node level cgroup with containerd and CRI-O.

For second alpha iteration, (1.27), we plan to add new E2E node e2e tests to validate the MemoryQoS settings are applied correctly.

Prerequisite testing updates

Unit tests

pkg/kubelet/cm: 02/09/2023 - 65.6

Integration tests

n/a: plan to use node e2e tests (see below)

e2e tests

As part of alpha, we plan to add a new node e2e test to validate that the MemoryQoS settings will be correctly set on both pods as well as node allocatable. The test will be reside in test/e2e_node.

Graduation Criteria

Alpha Graduation

cgroup_v2 is in Alpha
Memory QoS is implemented for new feature gate
Memory QoS is covered by proper tests
Memory QoS supports containerd, cri-o

Beta Graduation

cgroup_v2 is in Beta
Metrics and graphs to show the amount of reclaim done on a cgroup as it moves from below-request to above-request to throttling
Memory QoS is covered by unit and e2e-node tests
Memory QoS supports containerd, cri-o and dockershim
Expose memory events e.g. memory.high field of memory.events which can inform how many times memory.high was breached and the cgroup was throttled. https://docs.kernel.org/admin-guide/cgroup-v2.html

GA Graduation

cgroup_v2 is in GA
Memory QoS has been in beta for at least 2 releases
Memory QoS sees use in 3 projects or articles
Memory QoS is covered by conformance tests

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: MemoryQoS
- Components depending on the feature gate: kubelet

Does enabling the feature change any default behavior?

Yes, the kubelet will set memory.min for Guaranteed and Burstable pod/container level cgroup. It also will set memory.high for burstable and best effort containers, which may cause memory allocation to be slowed down is the memory usage level in the containers reaches memory.high level. memory.min for qos or node level cgroup will be set when --cgroups-per-qos or --enforce-node-allocatable is satisfied.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, related cgroups can be rolled back, memory.min/memory.high will reset to default value.

What happens if we reenable the feature if it was previously rolled back?

The kubelet will reconcile memory.min/memory.high with related cgroups.

Are there any tests for feature enablement/disablement?

Yes, some unit tests are exercised with the feature both enabled and disabled to verify proper behavior in both cases. When enabled, we test memory.min/memory.high for workloads and node cgroups whether it is proper value. When transitioning from enabled to disabled happens, we verify memory.min/memory.high whether be reset to default value.

Rollout, Upgrade and Rollback Planning

N/A There's no API change involved. MemoryQos is a kubelet level flag, that will be enabled by default in Beta. It doesn't require any special opt-in by the user in their PodSpec.

The kubelet will reconcile memory.min/memory.high with related cgroups depending on whether the feature gate is enabled or not separately for each node.

How can a rollout or rollback fail? Can it impact already running workloads?

Already running workloads will not have memory.min/memory.high set at Pod level. Only memory.min will be set at Node level cgroup when the kubelet restarts. The existing workloads will be impacted only when kernel isn't able to maintain at least memory.min level of memory for the non-guaranteed workloads within the Node level cgroup.

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

An operator could run ls /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<SOME_ID>.slice on a node with cgroupv2 enabled to confirm the presence of memory.min file which tells us that the feature is in use by the workloads.

How can someone using this feature know that it is working for their instance?

[] Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: Kernel memory events will be available in kubelet logs via cadvisor. These events will inform about the number of times memory.min and memory.high levels were breached.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A. Same as when running without this feature.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details: Not a service

Are there any missing metrics that would be useful to have to improve observability of this feature?

No

Dependencies

Does this feature depend on any specific services running in the cluster?

The container runtime must also support cgroup v2

Scalability

Will enabling / using this feature result in any new API calls?

No, new API calls will be generated.

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, resources like PIDs, sockets, inodes will not be affected. However, additional memory throttling can be experienced which is intended by this feature.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2020/03/14: initial proposal
2020/05/05: target Alpha to v1.22
2023/03/03: target Alpha v2 to v1.27
2023/06/14: target Beta to v1.28

Drawbacks

The main drawbacks are concerns about unintended memory throttling and additional complexity due to to utilization of several new cgroupv2 based memory controls (i.e memory.low, memory.high, etc).

However, we believe that impact of unintended throttling will be minimized due to a high throttling factor (see above) and the additional complexity is justified due to the additional resource management benefits

Alternatives

Please refer to alternatives mentioned above in the proposal section, which discusses the alternatives and changes from the original alpha design to the newly updated alpha design.

Infrastructure Needed (Optional)

n/a, not new infrastructure is needed, this KEP aims to reuse the existing node e2e jobs and framework.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
container-level.png		container-level.png
kep.yaml		kep.yaml
memory-high.png		memory-high.png
node-level.png		node-level.png
pod-level.png		pod-level.png
qos-level.png		qos-level.png

Files

2570-memory-qos

Directory actions

More options

Directory actions

More options

Latest commit

History

2570-memory-qos

Folders and files

parent directory

README.md

KEP-2570: Support Memory QoS with cgroups v2

Release Signoff Checklist

Latest Update [Stalled]

Summary

Motivation

Goals

Non-Goals

Proposal

Alpha v1.22

Alpha v1.27

Reasons for changing the formula of memory.high calculation in Alpha v1.27

Quality of Service for Pods

Alternative solutions for implementing memory.high

Beta v1.28 - Cancelled

User Stories (Optional)

Memory Sensitive Workload

Node Availability

Comparison with Memory Manager

memory.low vs memory.min

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Prerequisite

Feature Gate

Mapping Rules

Container/Pod

Node

Interactive

Workflow

Container

Pod

QoS

Node

Cgroup Hierarchy

Cgroup v2 Support

Container Runtime Interface (CRI) Changes

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha Graduation

Beta Graduation

GA Graduation

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?