-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923
kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup #125923
Conversation
// By default, systemd will not create it, as we've not chosen to delegate it, and we haven't included it in the Apply() request. | ||
// However, this causes a bug where kubelet restarts unnecessarily (cpuset cgroup is created in the cgroupfs, but systemd | ||
// doesn't know about it and deletes it, and then kubelet doesn't continue because the cgroup isn't configured as expected). | ||
// An alternative is to delegate the `cpuset` cgroup to the kubelet, but that would require some plumbing in libcontainer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what plumbing do you refer to here? IOW what's missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we currently use libcontainer for all of our cgroup management and there's no way to set a systemd property through the libcontainer manager. we could just use a godbus instance ourselves but it'd take some setup and copied code
machineInfo, err := cm.cadvisorInterface.MachineInfo() | ||
if err != nil { | ||
klog.V(4).InfoS("Failed to get machine info to get default cpuset", "error", err) | ||
return cpuset.CPUSet{} | ||
} | ||
topo, err := topology.Discover(machineInfo) | ||
if err != nil { | ||
klog.V(4).InfoS("Failed to get topology info to get default cpuset", "error", err) | ||
return cpuset.CPUSet{} | ||
} | ||
return topo.CPUDetails.CPUs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A weird idea: what if we just take the contents of /sys/fs/cgroup/cpuset.cpus.effective
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's an option. I opted for this because it matches what the cpumanager does to initialize the full set of CPUs, which i figure may be more consistent to have the kubelet gather the cpu list one way. I am open though.
60550dc
to
cfe9fe6
Compare
/triage accepted |
return &rc | ||
} | ||
|
||
func (cm *containerManagerImpl) getAllCPUs() cpuset.CPUSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just thinking aloud, are we in the container manager flow before the cpumanager is initialized?
I'd like to explore the option to put this logic inside cpumanager, so we can avoid to peek on its options from the outside and to duplicate the topology discovery logic.
Note: I'm NOT suggesting to pivot to this approach in this PR, just exploring (myself) the option to see how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, after a quick check we do have a cpumanager
instance in the containerManagerImpl
when we reach this code, so moving the functionality inside cpumanager
and remove quite some duplication is at least possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sketch (utterly untested, for demo purposes only):
diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 8b5049d7d74..e7fb1cdb8aa 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -93,6 +93,10 @@ type Manager interface {
// GetCPUAffinity returns cpuset which includes cpus from shared pools
// as well as exclusively allocated cpus
GetCPUAffinity(podUID, containerName string) cpuset.CPUSet
+
+ // GetAllCPUs returns all the CPUs known by cpumanager, as reported by the
+ // hardware discovery. Maps to the CPU capacity.
+ GetAllCPUs() cpuset.CPUSet
}
type manager struct {
@@ -136,7 +140,11 @@ type manager struct {
// stateFileDirectory holds the directory where the state file for checkpoints is held.
stateFileDirectory string
- // allocatableCPUs is the set of online CPUs as reported by the system
+ // allCPUs is the set of online CPUs as reported by the system
+ allCPUs cpuset.CPUSet
+
+ // allocatableCPUs is the set of online CPUs as reported by the system,
+ // and available for allocation, minus the reserved set
allocatableCPUs cpuset.CPUSet
// pendingAdmissionPod contain the pod during the admission phase
@@ -156,6 +164,11 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc
var policy Policy
var err error
+ topo, err = topology.Discover(machineInfo)
+ if err != nil {
+ return nil, err
+ }
+
switch policyName(cpuPolicyName) {
case PolicyNone:
@@ -165,10 +178,6 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc
}
case PolicyStatic:
- topo, err = topology.Discover(machineInfo)
- if err != nil {
- return nil, err
- }
klog.InfoS("Detected CPU topology", "topology", topo)
reservedCPUs, ok := nodeAllocatableReservation[v1.ResourceCPU]
@@ -205,6 +214,7 @@ func NewManager(cpuPolicyName string, cpuPolicyOptions map[string]string, reconc
topology: topo,
nodeAllocatableReservation: nodeAllocatableReservation,
stateFileDirectory: stateFileDirectory,
+ allCPUs: topo.CPUDetails().CPUs(),
}
manager.sourcesReady = &sourcesReadyStub{}
return manager, nil
@@ -339,6 +349,10 @@ func (m *manager) GetAllocatableCPUs() cpuset.CPUSet {
return m.allocatableCPUs.Clone()
}
+func (m *manager) GetAllCPUs() cpuset.CPUSet {
+ return m.allCPUs.Clone()
+}
+
type reconciledContainer struct {
podName string
containerName string
diff --git a/pkg/kubelet/cm/node_container_manager_linux.go b/pkg/kubelet/cm/node_container_manager_linux.go
index 9c9c91bc6f2..42d5f8939e3 100644
--- a/pkg/kubelet/cm/node_container_manager_linux.go
+++ b/pkg/kubelet/cm/node_container_manager_linux.go
@@ -31,12 +31,9 @@ import (
utilfeature "k8s.io/apiserver/pkg/util/feature"
"k8s.io/klog/v2"
kubefeatures "k8s.io/kubernetes/pkg/features"
- "k8s.io/kubernetes/pkg/kubelet/cm/cpumanager"
- "k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/topology"
"k8s.io/kubernetes/pkg/kubelet/events"
"k8s.io/kubernetes/pkg/kubelet/stats/pidlimit"
kubetypes "k8s.io/kubernetes/pkg/kubelet/types"
- "k8s.io/utils/cpuset"
)
const (
@@ -194,9 +191,10 @@ func (cm *containerManagerImpl) getCgroupConfig(rl v1.ResourceList) *ResourceCon
// An alternative is to delegate the `cpuset` cgroup to the kubelet, but that would require some plumbing in libcontainer,
// and this is sufficient.
// Only do so on None policy, as Static policy will do its own updating of the cpuset.
- if cm.NodeConfig.CPUManagerPolicy == string(cpumanager.PolicyNone) {
+ // Please see the comment on policy none's GetAllocatableCPUs
+ if cm.cpuManager.GetAllocatableCPUs().IsEmpty() {
if cm.allCPUs.IsEmpty() {
- cm.allCPUs = cm.getAllCPUs()
+ cm.allCPUs = cm.cpuManager.GetAllCPUs()
}
rc.CPUSet = cm.allCPUs
}
@@ -204,20 +202,6 @@ func (cm *containerManagerImpl) getCgroupConfig(rl v1.ResourceList) *ResourceCon
return &rc
}
-func (cm *containerManagerImpl) getAllCPUs() cpuset.CPUSet {
- machineInfo, err := cm.cadvisorInterface.MachineInfo()
- if err != nil {
- klog.V(4).InfoS("Failed to get machine info to get default cpuset", "error", err)
- return cpuset.CPUSet{}
- }
- topo, err := topology.Discover(machineInfo)
- if err != nil {
- klog.V(4).InfoS("Failed to get topology info to get default cpuset", "error", err)
- return cpuset.CPUSet{}
- }
- return topo.CPUDetails.CPUs()
-}
-
// GetNodeAllocatableAbsolute returns the absolute value of Node Allocatable which is primarily useful for enforcement.
// Note that not all resources that are available on the node are included in the returned list of resources.
// Returns a ResourceList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is better I've pushed an adapted version, PTAL
ac4619a
to
0a20211
Compare
pkg/kubelet/cm/types.go
Outdated
) | ||
|
||
// ResourceConfig holds information about all the supported cgroup resource parameters. | ||
type ResourceConfig struct { | ||
// Memory limit (in bytes). | ||
Memory *int64 | ||
// CPU set (number of cpus the cgroup has access to). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit.
// CPU set (number of cpus the cgroup has access to). | |
// CPU set (number of CPUs the cgroup has access to). |
For consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
0a20211
to
aaa141b
Compare
@ffromani lgtm? |
/lgtm sorry folks, this fell through the cracks. I think is good as we can make it, I'm happy with the change as-is now. |
LGTM label has been added. Git tree hash: 1d20a47287ac4be17ac0d84648727b4a398e2501
|
// Update the Kubelet configuration. | ||
ginkgo.By("Stopping the kubelet") | ||
startKubelet := stopKubelet() | ||
|
||
// wait until the kubelet health check will fail | ||
gomega.Eventually(ctx, func() bool { | ||
return kubeletHealthCheck(kubeletHealthCheckURL) | ||
}).WithTimeout(time.Minute).WithPolling(time.Second).Should(gomega.BeFalseBecause("expected kubelet health check to be failed")) | ||
ginkgo.By("Stopped the kubelet") | ||
|
||
framework.ExpectNoError(e2enodekubelet.WriteKubeletConfigFile(oldCfg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the right ordering? Should we write file first and then restart? Presumably it will be faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would it be faster? if we're waiting sequentially on both then it should be the same amount of time AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you will not need to do separate stop and start and will just do restartKubelet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We likely not run it as systemd daemon now, but if we will, it also will be more reliable =). But this is hypothetical at this stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see, done
on None cpumanager policy, cgroupv2, and systemd cgroup manager, kubelet could get into a situation where it believes the cpuset cgroup was created (by libcontainer in the cgroupfs) but systemd has deleted it, as it wasn't requested to create it. This causes one unnecessary restart, as kubelet fails with `failed to initialize top level QOS containers: root container [kubepods] doesn't exist.` This only causes one restart because the kubelet skips recreating the cgroup if it already exists, but it's still a bother and is fixed this way Signed-off-by: Peter Hunt <pehunt@redhat.com>
1631e9e
to
e4a2e96
Compare
Authored-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Peter Hunt <pehunt@redhat.com>
with systemd cgroup driver and cpumanager none policy. This was originally planned to be a correctness check for https://issues.k8s.io/125923, but it was difficult to reproduce the bug, so it's now a regression test against it. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Peter Hunt <pehunt@redhat.com>
e4a2e96
to
cc87438
Compare
Signed-off-by: Peter Hunt <pehunt@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 4698e145d83ff71f2ca21776af39783c7ff98dcc
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, kwilczynski, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@haircommander: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/retest |
What type of PR is this?
/kind bug
What this PR does / why we need it:
on None cpumanager policy, cgroupv2, and systemd cgroup manager, kubelet could get into a situation where it believes the cpuset cgroup was created (by libcontainer in the cgroupfs) but systemd has deleted it, as it wasn't requested to create it. This causes one unnecessary restart, as kubelet fails with
failed to initialize top level QOS containers: root container [kubepods] doesn't exist.
This only causes one restart because the kubelet skips recreating the cgroup if it already exists, but it's still a bother and is fixed this way
Which issue(s) this PR fixes:
Fixes #122955
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: