kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923

haircommander · 2024-07-05T17:46:40Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

on None cpumanager policy, cgroupv2, and systemd cgroup manager, kubelet could get into a situation where it believes the cpuset cgroup was created (by libcontainer in the cgroupfs) but systemd has deleted it, as it wasn't requested to create it. This causes one unnecessary restart, as kubelet fails with

failed to initialize top level QOS containers: root container [kubepods] doesn't exist.

This only causes one restart because the kubelet skips recreating the cgroup if it already exists, but it's still a bother and is fixed this way

Which issue(s) this PR fixes:

Fixes #122955

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug where the kubelet ephemerally fails with `failed to initialize top level QOS containers: root container [kubepods] doesn't exist`, due to the cpuset cgroup being deleted on v2 with systemd cgroup manager.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

kolyshkin · 2024-07-05T20:01:54Z

pkg/kubelet/cm/node_container_manager_linux.go

+	// By default, systemd will not create it, as we've not chosen to delegate it, and we haven't included it in the Apply() request.
+	// However, this causes a bug where kubelet restarts unnecessarily (cpuset cgroup is created in the cgroupfs, but systemd
+	// doesn't know about it and deletes it, and then kubelet doesn't continue because the cgroup isn't configured as expected).
+	// An alternative is to delegate the `cpuset` cgroup to the kubelet, but that would require some plumbing in libcontainer,


Just curious, what plumbing do you refer to here? IOW what's missing?

we currently use libcontainer for all of our cgroup management and there's no way to set a systemd property through the libcontainer manager. we could just use a godbus instance ourselves but it'd take some setup and copied code

pkg/kubelet/cm/types.go

pkg/kubelet/cm/container_manager_linux.go

kolyshkin · 2024-07-05T20:12:58Z

pkg/kubelet/cm/node_container_manager_linux.go

+	machineInfo, err := cm.cadvisorInterface.MachineInfo()
+	if err != nil {
+		klog.V(4).InfoS("Failed to get machine info to get default cpuset", "error", err)
+		return cpuset.CPUSet{}
+	}
+	topo, err := topology.Discover(machineInfo)
+	if err != nil {
+		klog.V(4).InfoS("Failed to get topology info to get default cpuset", "error", err)
+		return cpuset.CPUSet{}
+	}
+	return topo.CPUDetails.CPUs()


A weird idea: what if we just take the contents of /sys/fs/cgroup/cpuset.cpus.effective?

that's an option. I opted for this because it matches what the cpumanager does to initialize the full set of CPUs, which i figure may be more consistent to have the kubelet gather the cpu list one way. I am open though.

ffromani · 2024-07-06T07:47:13Z

/triage accepted
/priority important-longterm

ffromani · 2024-07-06T07:50:15Z

pkg/kubelet/cm/node_container_manager_linux.go

 	return &rc
 }

+func (cm *containerManagerImpl) getAllCPUs() cpuset.CPUSet {


just thinking aloud, are we in the container manager flow before the cpumanager is initialized?
I'd like to explore the option to put this logic inside cpumanager, so we can avoid to peek on its options from the outside and to duplicate the topology discovery logic.
Note: I'm NOT suggesting to pivot to this approach in this PR, just exploring (myself) the option to see how it looks.

ok, after a quick check we do have a cpumanager instance in the containerManagerImpl when we reach this code, so moving the functionality inside cpumanager and remove quite some duplication is at least possible.

sketch (utterly untested, for demo purposes only):

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go index 8b5049d7d74..e7fb1cdb8aa 100644 --- a/pkg/kubelet/cm/cpumanager/cpu_manager.go +++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go @@ -93,6 +93,10 @@ type Manager interface { // GetCPUAffinity returns cpuset which includes cpus from shared pools // as well as exclusively allocated cpus GetCPUAffinity(podUID, containerName string) cpuset.CPUSet + + // GetAllCPUs returns all the CPUs known by cpumanager, as reported by the + // hardware discovery. Maps to the CPU capacity. + GetAllCPUs() cpuset.CPUSet } type manager struct { @@ -136,7 +140,11 @@ type manager struct { // stateFileDirectory holds the directory where the state file for checkpoints is held. stateFileDirectory string - // allocatableCPUs is the set of online CPUs as reported by the system + // allCPUs is the set of online CPUs as reported by the system + allCPUs cpuset.CPUSet + + // allocatableCPUs is the set of online CPUs as reported by the system, + // and available for allocation, minus the reserved set allocatableCPUs cpuset.CPUSet // pendingAdmissionPod contain the pod during the admission phase @@ -156,6 +164,11 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc var policy Policy var err error + topo, err = topology.Discover(machineInfo) + if err != nil { + return nil, err + } + switch policyName(cpuPolicyName) { case PolicyNone: @@ -165,10 +178,6 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc } case PolicyStatic: - topo, err = topology.Discover(machineInfo) - if err != nil { - return nil, err - } klog.InfoS("Detected CPU topology", "topology", topo) reservedCPUs, ok := nodeAllocatableReservation[v1.ResourceCPU] @@ -205,6 +214,7 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc topology: topo, nodeAllocatableReservation: nodeAllocatableReservation, stateFileDirectory: stateFileDirectory, + allCPUs: topo.CPUDetails().CPUs(), } manager.sourcesReady = &sourcesReadyStub{} return manager, nil @@ -339,6 +349,10 @@ func (m *manager) GetAllocatableCPUs() cpuset.CPUSet { return m.allocatableCPUs.Clone() } +func (m *manager) GetAllCPUs() cpuset.CPUSet { + return m.allCPUs.Clone() +} + type reconciledContainer struct { podName string containerName string diff --git a/pkg/kubelet/cm/node_container_manager_linux.go b/pkg/kubelet/cm/node_container_manager_linux.go index 9c9c91bc6f2..42d5f8939e3 100644 --- a/pkg/kubelet/cm/node_container_manager_linux.go +++ b/pkg/kubelet/cm/node_container_manager_linux.go @@ -31,12 +31,9 @@ import ( utilfeature "k8s.io/apiserver/pkg/util/feature" "k8s.io/klog/v2" kubefeatures "k8s.io/kubernetes/pkg/features" - "k8s.io/kubernetes/pkg/kubelet/cm/cpumanager" - "k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/topology" "k8s.io/kubernetes/pkg/kubelet/events" "k8s.io/kubernetes/pkg/kubelet/stats/pidlimit" kubetypes "k8s.io/kubernetes/pkg/kubelet/types" - "k8s.io/utils/cpuset" ) const ( @@ -194,9 +191,10 @@ func (cm *containerManagerImpl) getCgroupConfig(rl v1.ResourceList) *ResourceCon // An alternative is to delegate the `cpuset` cgroup to the kubelet, but that would require some plumbing in libcontainer, // and this is sufficient. // Only do so on None policy, as Static policy will do its own updating of the cpuset. - if cm.NodeConfig.CPUManagerPolicy == string(cpumanager.PolicyNone) { + // Please see the comment on policy none's GetAllocatableCPUs + if cm.cpuManager.GetAllocatableCPUs().IsEmpty() { if cm.allCPUs.IsEmpty() { - cm.allCPUs = cm.getAllCPUs() + cm.allCPUs = cm.cpuManager.GetAllCPUs() } rc.CPUSet = cm.allCPUs } @@ -204,20 +202,6 @@ func (cm *containerManagerImpl) getCgroupConfig(rl v1.ResourceList) *ResourceCon return &rc } -func (cm *containerManagerImpl) getAllCPUs() cpuset.CPUSet { - machineInfo, err := cm.cadvisorInterface.MachineInfo() - if err != nil { - klog.V(4).InfoS("Failed to get machine info to get default cpuset", "error", err) - return cpuset.CPUSet{} - } - topo, err := topology.Discover(machineInfo) - if err != nil { - klog.V(4).InfoS("Failed to get topology info to get default cpuset", "error", err) - return cpuset.CPUSet{} - } - return topo.CPUDetails.CPUs() -} - // GetNodeAllocatableAbsolute returns the absolute value of Node Allocatable which is primarily useful for enforcement. // Note that not all resources that are available on the node are included in the returned list of resources. // Returns a ResourceList.

yeah this is better I've pushed an adapted version, PTAL

kwilczynski · 2024-07-18T15:03:24Z

pkg/kubelet/cm/types.go

 )

 // ResourceConfig holds information about all the supported cgroup resource parameters.
 type ResourceConfig struct {
 	// Memory limit (in bytes).
 	Memory *int64
+	// CPU set (number of cpus the cgroup has access to).


Nit.

Suggested change

// CPU set (number of cpus the cgroup has access to).

// CPU set (number of CPUs the cgroup has access to).

For consistency.

SergeyKanzhelev · 2024-09-13T19:21:29Z

@ffromani lgtm?

ffromani · 2024-10-02T06:43:07Z

/lgtm

sorry folks, this fell through the cracks. I think is good as we can make it, I'm happy with the change as-is now.

k8s-ci-robot · 2024-10-02T06:43:14Z

LGTM label has been added.

Git tree hash: 1d20a47287ac4be17ac0d84648727b4a398e2501

SergeyKanzhelev · 2024-10-04T22:39:43Z

test/e2e_node/node_container_manager_test.go

+					// Update the Kubelet configuration.
+					ginkgo.By("Stopping the kubelet")
+					startKubelet := stopKubelet()
+
+					// wait until the kubelet health check will fail
+					gomega.Eventually(ctx, func() bool {
+						return kubeletHealthCheck(kubeletHealthCheckURL)
+					}).WithTimeout(time.Minute).WithPolling(time.Second).Should(gomega.BeFalseBecause("expected kubelet health check to be failed"))
+					ginkgo.By("Stopped the kubelet")
+
+					framework.ExpectNoError(e2enodekubelet.WriteKubeletConfigFile(oldCfg))


is it the right ordering? Should we write file first and then restart? Presumably it will be faster

why would it be faster? if we're waiting sequentially on both then it should be the same amount of time AFAICT

I mean you will not need to do separate stop and start and will just do restartKubelet.

We likely not run it as systemd daemon now, but if we will, it also will be more reliable =). But this is hypothetical at this stage

ah I see, done

on None cpumanager policy, cgroupv2, and systemd cgroup manager, kubelet could get into a situation where it believes the cpuset cgroup was created (by libcontainer in the cgroupfs) but systemd has deleted it, as it wasn't requested to create it. This causes one unnecessary restart, as kubelet fails with `failed to initialize top level QOS containers: root container [kubepods] doesn't exist.` This only causes one restart because the kubelet skips recreating the cgroup if it already exists, but it's still a bother and is fixed this way Signed-off-by: Peter Hunt <pehunt@redhat.com>

Authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Peter Hunt <pehunt@redhat.com>

with systemd cgroup driver and cpumanager none policy. This was originally planned to be a correctness check for https://issues.k8s.io/125923, but it was difficult to reproduce the bug, so it's now a regression test against it. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Peter Hunt <pehunt@redhat.com>

Signed-off-by: Peter Hunt <pehunt@redhat.com>

SergeyKanzhelev

/lgtm
/approve

k8s-ci-robot · 2024-10-11T21:11:42Z

LGTM label has been added.

Git tree hash: 4698e145d83ff71f2ca21776af39783c7ff98dcc

k8s-ci-robot · 2024-10-11T21:11:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kwilczynski, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/OWNERS~~ [SergeyKanzhelev]
~~test/e2e_node/OWNERS~~ [SergeyKanzhelev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-10-11T21:56:03Z

@haircommander: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-node-kubelet-serial-crio-cgroupv1	`90a70c8`	link	false	`/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

SergeyKanzhelev · 2024-10-11T22:06:58Z

/retest

This was referenced Jul 5, 2024

cpuset disappears from /sys/fs/cgroup/cgroup.subtree_control after adding it coreos/go-systemd#443

Closed

kubelet restart: failed to initialize top level QOS containers: root container [kubepods] doesn't exist cri-o/cri-o#7701

Closed

k8s-ci-robot requested review from matthyx and mtaufen July 5, 2024 17:57

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 5, 2024

kolyshkin reviewed Jul 5, 2024

View reviewed changes

pkg/kubelet/cm/types.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jul 5, 2024

View reviewed changes

pkg/kubelet/cm/container_manager_linux.go Outdated Show resolved Hide resolved

kolyshkin reviewed Jul 5, 2024

View reviewed changes

haircommander force-pushed the cpuset-fix-restart branch from 60550dc to cfe9fe6 Compare July 5, 2024 20:21

ffromani reviewed Jul 6, 2024

View reviewed changes

ffromani mentioned this pull request Jul 8, 2024

kubelet: do not set CPU quota for guaranteed pods #117030

Open

harche mentioned this pull request Jul 8, 2024

WIP: Openshift csr fix openshift/kubernetes#2014

Closed

haircommander force-pushed the cpuset-fix-restart branch from ac4619a to 0a20211 Compare July 8, 2024 16:01

kwilczynski reviewed Jul 18, 2024

View reviewed changes

haircommander force-pushed the cpuset-fix-restart branch from 0a20211 to aaa141b Compare July 18, 2024 18:25

k8s-ci-robot requested a review from SergeyKanzhelev September 13, 2024 19:18

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 4, 2024

SergeyKanzhelev reviewed Oct 4, 2024

View reviewed changes

haircommander force-pushed the cpuset-fix-restart branch from 1631e9e to e4a2e96 Compare October 11, 2024 14:51

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2024

k8s-ci-robot requested a review from SergeyKanzhelev October 11, 2024 14:51

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 11, 2024

haircommander and others added 2 commits October 11, 2024 11:29

kubelet/cm: move CPU reading from cm to cm/cpumanager

77d03e4

Authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Peter Hunt <pehunt@redhat.com>

haircommander force-pushed the cpuset-fix-restart branch from e4a2e96 to cc87438 Compare October 11, 2024 15:29

e2e_node: use restart instead of start stop

b94c538

Signed-off-by: Peter Hunt <pehunt@redhat.com>

SergeyKanzhelev approved these changes Oct 11, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2024

k8s-ci-robot merged commit 762a85e into kubernetes:master Oct 11, 2024
15 checks passed

k8s-ci-robot added this to the v1.32 milestone Oct 11, 2024

openshift-ci-robot mentioned this pull request Oct 14, 2024

OCPBUGS-28812: UPSTREAM: <carry>: kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup openshift/kubernetes#2031

Merged

openshift-ci-robot mentioned this pull request Oct 25, 2024

[release-4.17] OCPBUGS-43820: UPSTREAM: <carry>: kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup openshift/kubernetes#2123

Closed

ffromani mentioned this pull request Oct 31, 2024

k8s cgroup cpuset disappear causing container restart #128397

Closed

openshift-ci-robot mentioned this pull request Nov 1, 2024

[release-4.17] OCPBUGS-43820: UPSTREAM: <carry>: kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup openshift/kubernetes#2126

Merged

openshift-ci-robot mentioned this pull request Dec 9, 2024

[release-4.16] OCPBUGS-45931: UPSTREAM: <carry>: kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup openshift/kubernetes#2158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923

kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923

haircommander commented Jul 5, 2024 •

edited

Loading

kolyshkin Jul 5, 2024

haircommander Jul 5, 2024

kolyshkin Jul 5, 2024

haircommander Jul 5, 2024

ffromani commented Jul 6, 2024

ffromani Jul 6, 2024

ffromani Jul 6, 2024

ffromani Jul 6, 2024

haircommander Jul 8, 2024

kwilczynski Jul 18, 2024

haircommander Jul 18, 2024

SergeyKanzhelev commented Sep 13, 2024

ffromani commented Oct 2, 2024 •

edited

Loading

k8s-ci-robot commented Oct 2, 2024

SergeyKanzhelev Oct 4, 2024

haircommander Oct 11, 2024

SergeyKanzhelev Oct 11, 2024

SergeyKanzhelev Oct 11, 2024

haircommander Oct 11, 2024

SergeyKanzhelev left a comment

k8s-ci-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024 •

edited

Loading

SergeyKanzhelev commented Oct 11, 2024

	// CPU set (number of cpus the cgroup has access to).
	// CPU set (number of CPUs the cgroup has access to).

kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923

kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923

Conversation

haircommander commented Jul 5, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Jul 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKanzhelev commented Sep 13, 2024

ffromani commented Oct 2, 2024 • edited Loading

k8s-ci-robot commented Oct 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024

k8s-ci-robot commented Oct 11, 2024 • edited Loading

SergeyKanzhelev commented Oct 11, 2024

haircommander commented Jul 5, 2024 •

edited

Loading

ffromani commented Oct 2, 2024 •

edited

Loading

k8s-ci-robot commented Oct 11, 2024 •

edited

Loading