OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074

paralin · 2016-03-16T21:53:34Z

On the default "jessie" deployment for Kubernetes (under aws), I finally have OpenShift builds working (thank you @liggitt and @ncdc)

It seems now that issue #7297 has resurfaced.

Contents of /sys/fs/cgroup/cpu/:

admin@ip-172-20-0-9:/sys/fs/cgroup/cpu$ ls
cgroup.clone_children  cgroup.sane_behavior  cpuacct.usage         cpu.shares  notify_on_release  tasks
cgroup.procs           cpuacct.stat          cpuacct.usage_percpu  docker      release_agent

Error output from build:

F0316 21:46:24.866361       1 builder.go:204] Error: failed to retrieve cgroup limits: cannot determine cgroup limits: open /sys/fs/cgroup/cpu/cpu.cfs_quota_us: no such file or directory

I would probably assume that this was an issue with the privileged flag not working correctly (perhaps due to Kubelets being potentially configured to not allow privileged pods) but the path does not exist in the host system, so I think this might be a feature that's not enabled in the kernel or something of this nature.

The text was updated successfully, but these errors were encountered:

paralin · 2016-03-16T23:51:08Z

Okay the kernel option CONFIG_CFS_BANDWIDTH=y is not enabled on the AWS jessie OS.

Could be classified as a Kubernetes deploy bug, maybe. I think OpenShift should handle the situation that this cgroup feature is not enabled because end-users might deploy to bare metal with an unknown OS installed.

paralin · 2016-03-17T12:01:13Z

This particular problem doesn't apply to "wily" on AWS, and I think most newer distributions, so I'd say it's lower priority

bparees · 2016-03-17T17:41:09Z

@derekwaynecarr @mrunalp @smarterclayton what would we want to do in the case that the cgroup filesystem is not available? I really don't want to fallback to "no limit" in that scenario, but i'm not sure what else would be appropriate.

Is it even valid to run k8s/openshift w/o the CONFIG_CFS_BANDWIDTH=y kernel option, given that it makes our quotas impossible to enforce(I assume)?

smarterclayton · 2016-03-17T18:11:49Z

In general, I would expect us to gracefully degrade if it's a configuration thing (warn). What I'm concerned with is flakes - cgroup being notoriously racy at the best of times, what happens if we tolerate missing cgroups, but instead of it failing, we continue (thus the admin has only a small chance of catching it)? That seems risky.

What might make sense is to have a Kube/Origin flag for "allow-missing-cgroups" that a deployer can opt in to that makes provision for these things not existing. For Origin, it would have to be in the controller config (so that it can be passed to the build pod). But for Kube, seems much more likely to be a node level thing (so maybe a separate flag).

If it's not blocking you now, we could potentially say "barring a new cgroup being introduced that Kube uses but no one supports, we will continue to error out".

paralin · 2016-03-17T18:19:46Z

Unfortunately this fails under the default AWS deploy for Kubernetes. @justinsb

Maybe we can escalate this to a "let's fix this in Kubernetes AWS deploy" then, if you want to enforce the option being enabled.

mrunalp · 2016-03-17T18:25:29Z

I think that either our installer or the openshift binary could check for the cgroups subsystems on startup and fail if the expected cgroups subsystems aren't found. This can be checked by either parsing /proc/self/cgroup or /proc/cgroups files.

justinsb · 2016-03-17T18:46:03Z

We can fix this in our debian image, I think - or at least look into doing so. Can you open an issue please?

In the meantime, you might want to use wily (export KUBE_OS_DISTRIBUTION=wily)

paralin · 2016-03-17T19:30:10Z

On wily I now almost instantly get this build error:


I0317 19:29:24.730862       1 source.go:197] Downloading "https://github.com/enokd/docker-node-hello.git" ...
W0317 19:29:26.454135       1 common.go:89] An error occurred saving build revision: the server could not find the requested resource (put builds docker-node-hello-4)
Step 1 : FROM centos@sha256:ec1bf627545d77d05270b3bbd32a9acca713189c58bc118f21abd17ff2629e3f
 ---> ed452988fb6e
Step 2 : RUN rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
 ---> Using cache
 ---> c8993aa3f727
Step 3 : RUN yum install -y -q npm
 ---> [Warning] Your kernel does not support swap limit capabilities, memory limited without swap.
 ---> Running in c9ea818b4eb8
F0317 19:29:26.805547       1 builder.go:204] Error: build error: [8] System error: The minimum allowed cpu-shares is 1024

Cpu-shares.

bparees · 2016-03-17T19:31:40Z

@paralin can you provide the rest of the build log? I believe we dump out the cgroup settings we are using (at least if you enable loglevel 5 in your build)

mrunalp · 2016-03-17T19:33:28Z

@paralin What version of docker do are you running?

paralin · 2016-03-17T19:34:51Z

@mrunalp 1.9.1

paralin · 2016-03-17T19:35:05Z

@bparees how do I enable loglevel 5 in a build?

bparees · 2016-03-17T19:46:29Z

@paralin discussed here: https://docs.openshift.org/latest/dev_guide/builds.html#accessing-build-logs

paralin · 2016-03-18T00:01:28Z

Big props to your work on the web UI. Much easier to make quick tweaks to things like the trigger list and such.

Log level 5:

I0317 23:24:04.327588       1 builder.go:57] Master version "v1.1.4", Builder version "v1.1.4"
I0317 23:24:04.328387       1 builder.go:145] Running build with cgroup limits: api.CGroupLimits{MemoryLimitBytes:9223372036854771712, CPUShares:2, CPUPeriod:100000, CPUQuota:-1, MemorySwap:9223372036854771712}
I0317 23:24:04.328857       1 source.go:197] Downloading "https://github.com/enokd/docker-node-hello.git" ...
I0317 23:24:04.917862       1 source.go:208] Cloning source from https://github.com/enokd/docker-node-hello.git
W0317 23:24:06.298439       1 common.go:89] An error occurred saving build revision: the server could not find the requested resource (put builds docker-node-hello-6)
Step 1 : FROM centos@sha256:ec1bf627545d77d05270b3bbd32a9acca713189c58bc118f21abd17ff2629e3f
 ---> ed452988fb6e
Step 2 : ENV "BUILD_LOGLEVEL" "2"
 ---> Using cache
 ---> ad6a59f257b6
Step 3 : RUN rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
 ---> Using cache
 ---> 86f9329cc8f9
Step 4 : RUN yum install -y -q npm
 ---> Using cache
 ---> bbcea1291029
Step 5 : ADD . /src
 ---> dc3df86a396f
Removing intermediate container b8fa11a59617
Step 6 : RUN cd /src; npm install
 ---> [Warning] Your kernel does not support swap limit capabilities, memory limited without swap.
 ---> Running in 8acf707954ad
F0317 23:24:06.869910       1 builder.go:204] Error: build error: [8] System error: The minimum allowed cpu-shares is 1024

Odd also that it says it can't "put builds". This is probably related to another issue I made regarding the internal Kube stuff.

paralin · 2016-03-18T00:49:55Z

Now it's happening in the pod events too:

Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath      Type             Reason          Message
  ---------     --------        -----   ----                                                    -------------      -------- ------          -------
  1m            1m              1       {default-scheduler }                                                       Normal           Scheduled       Successfully assigned docker-registry-1-deploy to ip-172-20-0-77.us-west-2.compute.i
nternal
  1m            1m              1       {kubelet ip-172-20-0-77.us-west-2.compute.internal}     spec.containers{deployment} Normal          Pulling         pulling image "openshift/origin-deployer:v1.1.4"
  1m            1m              1       {kubelet ip-172-20-0-77.us-west-2.compute.internal}     spec.containers{deployment} Normal          Pulled          Successfully pulled image "openshift/origin-deployer:v1.1.4"
  1m            1m              1       {kubelet ip-172-20-0-77.us-west-2.compute.internal}     spec.containers{deployment} Normal          Created         Created container with docker id a03de03833a2
  1m            1m              1       {kubelet ip-172-20-0-77.us-west-2.compute.internal}     spec.containers{deployment} Warning         Failed          Failed to start container with docker id a03de03833a2 with error: API error 
(500): Cannot start container a03de03833a2917d9939904d06f8d834b89a6b0801ea3d2a9e7aeef7f9a470f8: [8] System error: The minimum allowed cpu-shares is 1024

  1m    1m      1       {kubelet ip-172-20-0-77.us-west-2.compute.internal}             Warning FailedSync      Error syncing pod, skipping: failed to "StartContainer" for "deployment" with RunContainerError: "runContainer: API erro
r (500): Cannot start container a03de03833a2917d9939904d06f8d834b89a6b0801ea3d2a9e7aeef7f9a470f8: [8] System error: The minimum allowed cpu-shares is 1024\n"

What a mess. I think the cgroup not existing happens in Jessie and this weird cpu-shares error happens under wily.

bparees · 2016-03-18T14:26:43Z

@mrunalp @derekwaynecarr any thoughts on
System error: The minimum allowed cpu-shares is 1024\n"

?

liggitt · 2016-03-18T14:33:26Z

kubernetes/kubernetes#23143, I think

mrunalp · 2016-03-18T15:51:46Z

I think the Debian version of docker needs the patch that I added to our version for joining all cgroups.

Sent from my iPhone

On Mar 18, 2016, at 7:26 AM, Ben Parees notifications@github.com wrote:

@mrunalp @derekwaynecarr any thoughts on
System error: The minimum allowed cpu-shares is 1024\n"

?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

derekwaynecarr · 2016-03-18T21:04:43Z

@mrunalp

the kubelet checks for the required stuff here on start-up.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L99

@bparees
Its a valid configuration of OpenShift to disable enforcement of cpu limits by passing the kubelet flag in the node-config.yaml for --cpu-cfs-quota=false

derekwaynecarr · 2016-03-18T21:05:14Z

@paralin - are you enabling the --cpu-cfs-quota flag on the kubernetes installs you are testing against?

bparees · 2016-03-18T23:43:47Z

Its a valid configuration of OpenShift to disable enforcement of cpu limits by passing the kubelet flag in the node-config.yaml for --cpu-cfs-quota=false

@derekwaynecarr what are the implications of that for me? Presumably the build pod would not have a cgroup limit value set, so we'd just read the quota as "-1" (the file would exist, but with a value of -1) and we'd set the limit on the builder container to -1 as well. Do we need to actually understand that enforcement is disabled and not try to set a cgroup limit at all?

derekwaynecarr · 2016-03-21T15:40:53Z

@bparees - I do not think it should impact you.

derekwaynecarr · 2016-03-21T15:44:35Z

@bparees - I take that back slightly. I guess its possible that when not running on RHEL-based distros that we would not find the cpu quota file when validating. I am not sure that is an environment we really support. We support disabling the kubelet from enforcing cpu limits, but I think we still require that the underlying OS could enforce quota.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L137

paralin · 2016-03-21T15:54:29Z

@derekwaynecarr if you want to support running under kubernetes then you can't assume anything about the host operating system. People could be running on bare metal with specific limitations due to specifications/restrictions etc. So I would suggest gracefully falling back with warnings, much like Redis outputs kernel option suggestions on startup when certain performance impacting issues are found.

gabemontero · 2018-01-23T19:33:37Z

@bparees @derekwaynecarr - with the CRI-O related changes via 1c1f653#diff-bb3dfd2a9b83662fd66fdf176d55aa53 from last fall, we seem to tolerate the lack of a cgroup files.

See https://github.com/openshift/origin/blob/master/pkg/build/builder/util_linux.go#L59-L62 in particular

Seems like we can close this out as fixed .... agreed, or is there a nuance I'm overlooking in reviewing the history here?

thanks

bparees · 2018-01-23T19:40:27Z

@gabemontero huh. good for us :) Yeah I guess it can be closed out.

paralin mentioned this issue Mar 17, 2016

cgroup values not set correctly on pods #6822

Closed

danmcp added priority/P2 component/build labels Mar 17, 2016

danmcp assigned bparees Mar 17, 2016

paralin mentioned this issue Mar 17, 2016

AWS: jessie deployment needs additional cgroup kernel features enabled kubernetes/kubernetes#23142

Closed

bparees added the priority/P3 label Apr 8, 2016

bparees removed the priority/P2 label Apr 8, 2016

bparees changed the title ~~OpenShift builds cannot start due to cgroup error~~ OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) Jun 10, 2016

cdrage mentioned this issue Apr 18, 2017

Adding kompose up/down tests for openshift kubernetes/kompose#460

Merged

bparees assigned gabemontero and unassigned bparees Oct 9, 2017

bparees closed this as completed Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074

OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074

paralin commented Mar 16, 2016

paralin commented Mar 16, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

smarterclayton commented Mar 17, 2016

paralin commented Mar 17, 2016

mrunalp commented Mar 17, 2016

justinsb commented Mar 17, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

mrunalp commented Mar 17, 2016

paralin commented Mar 17, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

paralin commented Mar 18, 2016

paralin commented Mar 18, 2016

bparees commented Mar 18, 2016

liggitt commented Mar 18, 2016

mrunalp commented Mar 18, 2016

derekwaynecarr commented Mar 18, 2016

derekwaynecarr commented Mar 18, 2016

bparees commented Mar 18, 2016

derekwaynecarr commented Mar 21, 2016

derekwaynecarr commented Mar 21, 2016

paralin commented Mar 21, 2016

gabemontero commented Jan 23, 2018

bparees commented Jan 23, 2018

OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074

OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074

Comments

paralin commented Mar 16, 2016

paralin commented Mar 16, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

smarterclayton commented Mar 17, 2016

paralin commented Mar 17, 2016

mrunalp commented Mar 17, 2016

justinsb commented Mar 17, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

mrunalp commented Mar 17, 2016

paralin commented Mar 17, 2016

paralin commented Mar 17, 2016

bparees commented Mar 17, 2016

paralin commented Mar 18, 2016

paralin commented Mar 18, 2016

bparees commented Mar 18, 2016

liggitt commented Mar 18, 2016

mrunalp commented Mar 18, 2016

derekwaynecarr commented Mar 18, 2016

derekwaynecarr commented Mar 18, 2016

bparees commented Mar 18, 2016

derekwaynecarr commented Mar 21, 2016

derekwaynecarr commented Mar 21, 2016

paralin commented Mar 21, 2016

gabemontero commented Jan 23, 2018

bparees commented Jan 23, 2018