-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenShift builds cannot start due to cgroup error (when cgroups are not enabled) #8074
Comments
Okay the kernel option Could be classified as a Kubernetes deploy bug, maybe. I think OpenShift should handle the situation that this cgroup feature is not enabled because end-users might deploy to bare metal with an unknown OS installed. |
This particular problem doesn't apply to "wily" on AWS, and I think most newer distributions, so I'd say it's lower priority |
@derekwaynecarr @mrunalp @smarterclayton what would we want to do in the case that the cgroup filesystem is not available? I really don't want to fallback to "no limit" in that scenario, but i'm not sure what else would be appropriate. Is it even valid to run k8s/openshift w/o the CONFIG_CFS_BANDWIDTH=y kernel option, given that it makes our quotas impossible to enforce(I assume)? |
In general, I would expect us to gracefully degrade if it's a configuration thing (warn). What I'm concerned with is flakes - cgroup being notoriously racy at the best of times, what happens if we tolerate missing cgroups, but instead of it failing, we continue (thus the admin has only a small chance of catching it)? That seems risky. What might make sense is to have a Kube/Origin flag for "allow-missing-cgroups" that a deployer can opt in to that makes provision for these things not existing. For Origin, it would have to be in the controller config (so that it can be passed to the build pod). But for Kube, seems much more likely to be a node level thing (so maybe a separate flag). If it's not blocking you now, we could potentially say "barring a new cgroup being introduced that Kube uses but no one supports, we will continue to error out". |
Unfortunately this fails under the default AWS deploy for Kubernetes. @justinsb Maybe we can escalate this to a "let's fix this in Kubernetes AWS deploy" then, if you want to enforce the option being enabled. |
I think that either our installer or the openshift binary could check for the cgroups subsystems on startup and fail if the expected cgroups subsystems aren't found. This can be checked by either parsing /proc/self/cgroup or /proc/cgroups files. |
We can fix this in our debian image, I think - or at least look into doing so. Can you open an issue please? In the meantime, you might want to use wily (export KUBE_OS_DISTRIBUTION=wily) |
On wily I now almost instantly get this build error:
Cpu-shares. |
@paralin can you provide the rest of the build log? I believe we dump out the cgroup settings we are using (at least if you enable loglevel 5 in your build) |
@paralin What version of docker do are you running? |
@mrunalp |
@bparees how do I enable loglevel 5 in a build? |
Big props to your work on the web UI. Much easier to make quick tweaks to things like the trigger list and such. Log level 5:
Odd also that it says it can't "put builds". This is probably related to another issue I made regarding the internal Kube stuff. |
Now it's happening in the pod events too:
What a mess. I think the cgroup not existing happens in Jessie and this weird cpu-shares error happens under wily. |
@mrunalp @derekwaynecarr any thoughts on ? |
kubernetes/kubernetes#23143, I think |
I think the Debian version of docker needs the patch that I added to our version for joining all cgroups. Sent from my iPhone
|
the kubelet checks for the required stuff here on start-up. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L99 @bparees |
@paralin - are you enabling the |
@derekwaynecarr what are the implications of that for me? Presumably the build pod would not have a cgroup limit value set, so we'd just read the quota as "-1" (the file would exist, but with a value of -1) and we'd set the limit on the builder container to -1 as well. Do we need to actually understand that enforcement is disabled and not try to set a cgroup limit at all? |
@bparees - I do not think it should impact you. |
@bparees - I take that back slightly. I guess its possible that when not running on RHEL-based distros that we would not find the cpu quota file when validating. I am not sure that is an environment we really support. We support disabling the kubelet from enforcing cpu limits, but I think we still require that the underlying OS could enforce quota. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L137 |
@derekwaynecarr if you want to support running under kubernetes then you can't assume anything about the host operating system. People could be running on bare metal with specific limitations due to specifications/restrictions etc. So I would suggest gracefully falling back with warnings, much like Redis outputs kernel option suggestions on startup when certain performance impacting issues are found. |
@bparees @derekwaynecarr - with the CRI-O related changes via 1c1f653#diff-bb3dfd2a9b83662fd66fdf176d55aa53 from last fall, we seem to tolerate the lack of a cgroup files. See https://github.com/openshift/origin/blob/master/pkg/build/builder/util_linux.go#L59-L62 in particular Seems like we can close this out as fixed .... agreed, or is there a nuance I'm overlooking in reviewing the history here? thanks |
@gabemontero huh. good for us :) Yeah I guess it can be closed out. |
On the default "jessie" deployment for Kubernetes (under aws), I finally have OpenShift builds working (thank you @liggitt and @ncdc)
It seems now that issue #7297 has resurfaced.
Contents of
/sys/fs/cgroup/cpu/
:Error output from build:
I would probably assume that this was an issue with the privileged flag not working correctly (perhaps due to Kubelets being potentially configured to not allow privileged pods) but the path does not exist in the host system, so I think this might be a feature that's not enabled in the kernel or something of this nature.
The text was updated successfully, but these errors were encountered: