Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FG:InPlacePodVerticalScaling] Pod CPU limit is not configured to cgroups as calculated #129357

Open
hshiina opened this issue Dec 21, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@hshiina
Copy link
Contributor

hshiina commented Dec 21, 2024

What happened?

As a result of #124216, which was introduced in v.1.32, a pod CPU limit calculated in ResourceConfigForPod() is rounded up to the nearest 10ms in libcontainer at resizing the pod:

  • Resize a pod:
    $ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"417m"}}}]}}'
    pod/resize-pod patched
    
  • The container cgroup value is set with 1ms precision:
    $ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max
    41700 100000
    
  • The pod cgroup value is rounded up:
    $ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max
    42000 100000
    

When systemd cgroup driver is used, libcontainer passes the CPU Quota to systemd with rounding up:

// systemd converts CPUQuotaPerSecUSec (microseconds per CPU second) to CPUQuota
// (integer percentage of CPU) internally. This means that if a fractional percent of
// CPU is indicated by Resources.CpuQuota, we need to round up to the nearest
// 10ms (1% of a second) such that child cgroups can set the cpu.cfs_quota_us they expect.
cpuQuotaPerSecUSec = uint64(quota*1000000) / period
if cpuQuotaPerSecUSec%10000 != 0 {
cpuQuotaPerSecUSec = ((cpuQuotaPerSecUSec / 10000) + 1) * 10000
}

In addition, there seems to be a race in libcontainer. It directly writes values to the cgroup file without roundup after it passes the rounded value to systemd:

if err := setUnitProperties(m.dbus, getUnitName(m.cgroups), properties...); err != nil {
return fmt.Errorf("unable to set unit properties: %w", err)
}
return m.fsMgr.Set(r)

So, there is also a case where the cgroup value is set as calculated. As far as I tried, decreasing CPU limits usually hits this case though I’m not sure why:

  • Decrease the CPU limits:

    $ kubectl patch pod resize-pod --subresource=resize --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"cpu":"365m"}}}]}}'
    pod/resize-pod patched
    
  • Both the container and the pod cgroup values are set with 1ms precision:

    $ kubectl exec resize-pod -- cat /sys/fs/cgroup/cpu.max
    36500 100000
    $ cat /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-burstable.slice/kubelet-kubepods-burstable-pod68a17b59_0d31_40b2_ba86_ea43f3b2f05c.slice/cpu.max
    36500 100000
    

What did you expect to happen?

This tiny gap can be practically ignored. However, we might need to confirm this gap doesn’t cause a similar issue to #128769.

How can we reproduce it (as minimally and precisely as possible)?

  1. Use systemd cgroup driver and enable InPlacePodVertialScaling.
  2. Resize CPU limits of a pod with 1ms precision.

Anything else we need to know?

No response

Kubernetes version

V1.32

$ kubectl version
# paste output here
Client Version: v1.31.4
Kustomize Version: v5.4.2
Server Version: v1.32.0

Cloud provider

N/A

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@hshiina hshiina added the kind/bug Categorizes issue or PR as related to a bug. label Dec 21, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 21, 2024
@hshiina
Copy link
Contributor Author

hshiina commented Dec 21, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 21, 2024
@pacoxu
Copy link
Member

pacoxu commented Dec 23, 2024

cc @iholder101 @tallclair

@kannon92 kannon92 moved this from Triage to Triaged in SIG Node Bugs Dec 23, 2024
@kannon92
Copy link
Contributor

I am going to assume this is valid.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Triaged
Development

No branches or pull requests

4 participants