- Summary
- Motivation
- Goals
- Non-Goals
- User Stories
- Implementation Details
- Design
- Production Readiness Review Questionnaire
- Proposal
- Current cgroups usage and the equivalent in cgroups v2
- Risk and Mitigations
A proposal to add support for cgroups v2 to kubernetes.
The new kernel cgroups v2 API was declared stable more than two years ago. Newer features in the kernel such as PSI depend upon cgroups v2. cgroups v1 will eventually become obsolete in favor of cgroups v2. Some distros are already using cgroups v2 by default, and that prevents Kubernetes from working as it is required to run with cgroups v1.
This proposal aims to:
- Add support for cgroups v2 to the Kubelet
- Expose new cgroup2-only features
- Dockershim
- Plugins support
- The Kubelet can run on a host using either cgroups v1 or v2.
- Have features parity between cgroup v2 and v1.
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
The main unit test coverage is in the cgroup manager kubelet package under pkg/kubelet/cm
.
Kubelet uses the existing
libcontainer
library to manage cgroups so we will primarily be targeting integration testing
to verify the feature is working as intended.
Please see below under e2e tests
We would like to ensure that kubelet e2e node tests are tested under both variants (cgroupv1 and cgroupv2) based OS images. We currently have the following cgroupv2 jobs:
- https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-e2e / https://storage.googleapis.com/k8s-triage/index.html?job=sig-node-containerd%23cos-cgroupv2-containerd-e2e
- https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e / https://storage.googleapis.com/k8s-triage/index.html?job=cos-cgroupv2-containerd-node-e2e
- https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-features / https://storage.googleapis.com/k8s-triage/index.html?job=cos-cgroupv2-containerd-node-features
- https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e-serial / https://storage.googleapis.com/k8s-triage/index.html?job=cos-cgroupv2-containerd-node-e2e-serial
- https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv2-node-e2e-conformance / https://storage.googleapis.com/k8s-triage/index.html?job=ci-crio-cgroupv2-node-e2e-conformance
As part of graduation to GA, we plan to rename the cgroupv1 based jobs as
cgroupv1
and ensure node e2e test grid have clearly labelled cgroupv1 and cgroupv2 jobs for SIG node based test categories (e2e
, node-e2e
,
node-features
, and node-e2e-serial
).
-
Alpha: Phase 1 completed and basic support for running Kubernetes on a cgroups v2 host, e2e tests coverage or have a plan for the failing tests. A good candidate for running cgroup v2 test is Fedora 31 that has already switched to default to cgroup v2.
-
Beta: e2e tests coverage and performance testing. Verify that both the CPU and Memory Manager work.
-
GA: Assuming no negative user feedback based on production experience, promote after 2 releases in beta.
N/A. Not relevant to upgrades. If the host is running with cgroup v2 then it will be automatically detected and used.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism: configure the hosts to use cgroup v2
- Will enabling / disabling the feature require downtime of the control plane? No, each host can be restarted to cgroup v2 separately
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Config
feature is enabled). It requires downtime of a node since it needs to be rebooted
N/A. It must work in the same way as on cgroup v1
yes, it is enough to restart the node on cgroup v1
It should work seamlessly without any difference
The same E2E tests that work on cgroup v1 should work on cgroup v2
N/A. Each node can be configured separately.
N/A. It requires a reboot to be enabled. If the workload accesses directly the cgroup file system, then also the workload must be enabled for cgroup v2.
Pods not being healthy. One could inspect if the pods are getting the cgroups set correctly referencing the conversion table in this KEP.
N/A. It depends on the node configuration and it is stateless.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
The cgroup file system inside of the containers will use cgroup v2 instead of cgroup v1.
An operator could run cat /proc/self/cgroup
on a node to check if it is running in cgroups v2 mode.
If the node is using cgroup v2, then also the pods running on that node are using it.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: pods are healthy.
N/A. Same as when running on cgroup v1.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details: not a service
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
The container runtime must also support cgroup v2
No
No
No
No
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
N/A
N/A
If SLOs are not being met, reboot the node in cgroup v1 to disable this feature.
The proposal is to implement cgroups v2 in two different phases.
The first phase ensures that any configuration file designed for cgroups v1 will continue to work on cgroups v2.
The second phase requires changes through the entire stack, including the OCI runtime specifications.
At startup the Kubelet detects what hierarchy the system is using. It
checks the file system type for /sys/fs/cgroup
(the equivalent of
stat -f --format '%T' /sys/fs/cgroup
). If the type is cgroup2fs
then the Kubelet will use only cgroups v2 during all its execution.
The current proposal doesn't aim at deprecating cgroup v1, that must still be supported through the stack.
Device plugins that require v2 enablement are out of the scope for this proposal.
In order to support features only available in cgroups v2 but not in cgroups v1, the OCI runtime specs must be changed.
New features that are not present in cgroup v1 are out of the scope for this proposal.
The dockershim implementation embedded in the Kubelet won't be supported on cgroup v2.
-
CRI-O+crun: support cgroups v2
-
runc: since v1.0.0-rc91 experimentally, ready for production in v1.0.0-rc93
-
containerd: support cgroup v2 since v1.4.0
-
Moby: moby/moby#40174
-
OCI runtime spec: support cgroup v2 parameters
-
cAdvisor already supports cgroups v2 (google/cadvisor#2309)
Kubernetes cgroups v1 | Kubernetes cgroups v2 behavior |
---|---|
CPU stats for Horizontal Pod Autoscaling | No .percpu cpuacct stats. |
CPU pinning based on integral cores | Cpuset controller available |
Memory limits | Not changed, different naming |
PIDs limits | Not changed, same naming |
hugetlb | Added to linux-next, targeting Linux 5.6 |
A cgroup namespace restricts the view on the cgroups. When unshare(CLONE_NEWCGROUP) is done, the current cgroup the process resides in becomes the root. Other cgroups won't be visible from the new namespace. It was not enabled by default on a cgroup v1 system as older kernel lacked support for it.
Privileged pods will still use the host cgroup namespace so to have visibility on all the other cgroups.
We can convert the values passed by the k8s in cgroups v1 from to cgroups v2 so Kubernetes users don’t have to change what they specify in their manifests.
crun has implemented the conversion as follows:
Memory controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
limit | memory.max | y = x | |
swap | memory.swap.max | y = x | |
reservation | memory.low | y = x |
PIDs controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
limit | pids.max | y = x |
CPU controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
shares | cpu.weight | y = (1 + ((x - 2) * 9999) / 262142) | convert from [2-262144] to [1-10000] |
period | cpu.max | y = x | period and quota are written together |
quota | cpu.max | y = x | period and quota are written together |
blkio controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
weight | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000] |
weight_device | io.bfq.weight | y = (1 + (x - 10) * 9999 / 990) | convert linearly from [10-1000] to [1-10000] |
rbps | io.max | y=x | |
wbps | io.max | y=x | |
riops | io.max | y=x | |
wiops | io.max | y=x |
cpuset controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
cpus | cpuset.cpus | y = x | |
mems | cpuset.mems | y = x |
hugetlb controller
OCI (x) | cgroup 2 value (y) | conversion | comment |
---|---|---|---|
<PAGE_SIZE>.limit_in_bytes | hugetlb.<PAGE_SIZE>.max | y = x |
With this approach cAdvisor would have to read back values from cgroups v2 files (already done).
Kubelet PR: kubernetes/kubernetes#85218
This option means that the values are written directly to cgroups v2 by the runtime. The Kubelet doesn’t do any conversion when setting these values over the CRI. We will need to add a cgroups v2 specific LinuxContainerResources to the CRI.
This depends upon the container runtimes like runc and crun to be able to write cgroups v2 values directly.
OCI will need support for cgroups v2 and CRI implementations will write to the cgroups v2 section of the new OCI runtime config.json.
Some cgroups v1 features are not available with cgroups v2:
- cpuacct.usage_percpu
- network stats from cgroup
Some cgroups v1 controllers such as device and net_cls, net_prio are not available with the new version. The alternative to these controllers is to use eBPF.