-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic when having a privileged container with docker >= 1.10 #27885
Comments
The Docker issue you pointed to is a different kernel panic - need to find if your panic has been reported before.
and the one reported in the thread is a NULL pointer dereference with a different stack trace.
|
It seems that @jfrazelle encountered both issues before. |
ah that's a bad kernel I remember that, I think there is a minor release update for it in ubuntu that is much much better |
@rata Thanks for reporting the issue. @Random-Liu and I looked at the initial docker issue, and looks like there are several kernel panics and both docker 1.10.X and docker 1.11.X on various kernel versions are affected. So far, I didn't observe the same failure in our jenkins tests, it could be we paper over the issue somehow. Anyway, we should make the problem visible to the end users first, and help with the debugging and fix since it might affect our Kubernetes 1.3 users. Here are the plan I am thinking:
|
@dchen1107 SGTM! :) |
@girishkalele ohh, sorry. I was in a hurry and they looked similar, sorry didn't check in detail but didn't have the time. @dchen1107: thanks! Is there any way to have some confidence that upgrade to k8s 1.3 won't cause many issues with nodes crashing because of this? I mean, when thinking to upgrade my production cluster to 1.3, I may need to create a new one in 1.3, run things (only to test k8s) for a few weeks there and then maybe upgrade? There is no downgrade procedure, right? Also, just curious: is it a problem if docker 1.9 continues to be used? Or in 1.3 we are using some features that require docker > 1.9? Just to know if that is an option too, until the problem is better understood Maybe the bug is cause because some storage driver is used (and only affects that storage driver). My container was using debian:jessie and installed docker from docker's apt repositories and just started the daemon. I'm on mobile connection right now, so I can't check the driver easily. I can check it out in a few hours (like 6 hours) when I'm home again. |
1.9 should still be supported.
I think it's aufs based on the kernel log and how you installed docker. :) |
Yes, 1.9.1 is still compatible with Kubernetes 1.3 release here. |
@Random-Liu @dchen1107: awesome, thanks! I'll try using another storage driver and report back if I hit it or not :-) |
It seems kubernetes 1.2.4 in AWS uses docker with AUFS:
Is this the case on GKE and GCE too? I'll check what is the default storage driver in k8s 1.3 |
@rata, Kubernetes today support 3 different storage driver: aufs, overlayfs, and devicemapper. On both GKE and GCE case, Kubernetes are using aufs. We are switching to overlayfs through a new containervm image: gci, but just start this process. |
@dchen1107: thanks for the info. It seems it's difficult for me to use another storage driver as the kube-up setup on AWS uses aufs and as nodes crashes, nodes are created again with aufs and not easy to use other format without modifying the Auto Scaling Group. |
@rata @dchen1107 @girishkalele |
I am confused on what is going on here. During the release burndown @mike-saparov mentioned that we are considering recommending Docker v1.9 for k8s v1.3 because of this Kernel bug. However, it seems more reasonable to document it and ask the distros to patch their Kernels. Can someone give an update on what the current thinking is for the release? |
@philips: Maybe trying to reproduce helps you to have a better idea? That's the only thing I can add, the rest of the message is mostly about that and nothing else. So feel free to ignore :-) I can easily reproduce this using a a pod with two containers: a) privileged container with debian jessie with docker 1.11.2 or 1.10.3 from docker repos (it happens with both), b) docker-gc branch "fixes" from https://github.com/rata/docker-gc (actually, I realize the repo at work has a small script that sleeps and runs docker-gc in an infinite loop and that is run). Although, if I only use a pod with only one container with docker >= 1.10 installed in debian jessie and listen as a daemon and use it via the network to build docker images (just like the pod with 2 containers, but without docker-gc cache is not deleted), then after a few days it crashes too. But with docker-gc it crashes way faster. I can upload the Dockerfiles and yamls used if someone wants them I'm not sure if this bug has been fixed upstream or if @jfrazelle, that also saw this, knows a work around. Nor if the tests and people is using a newer docker version without issues. Maybe the bug is related to something docker-gc does and unlikely to happen. But to the best of my knowledge, is not known. And also, kubernetes deletes docker images when there is not enough space free, not sure if that (or something else that kubernetes might do and I don't know) makes it more likely to happen. I'll not have time to try to fix the kernel bug (or try newer kernels and see if it doesn't happen) these days. But no problem to help someone reproduce, upload the dockerfiles and deployments I use, etc. |
@philips We didn't recommend Docker 1.9 for k8s v1.3 yet, also what we discussed at burndown meeting has nothing to do with this issue. For this one, we plan to document it, or suggested the user of the node-problem-detector to upgrade their detector so that the kernel issues are visible to the end users. Also the users can understand why their applications being restarted or why their nodes being rebooted. At burndown meeting, we talked about 1.3 blocker issue: #27691. The engineers suspected the issues in either Kubernetes component (we changed the entire code path for 1.3) or docker runtime code. To narrow down the issue, we decided to run some tests against docker 1.9.1, and kubernetes 1.3 beta. |
@dchen1107 thanks for clarification! |
XREF #27076 |
this error still happens with Docker 1.12 btw. |
Just in case it's useful to someone, I workaround this by basically writing to an external volume. The pod that builds docker images now uses an EBS volume mounted on /var/lib/docker and this issue never happened again (so far, at least). This makes sense, as it seemed to be an aufs related issue and now it is not using it to write docker images. |
we downgraded kubernetes to 1.2.6 but keep using docker 1.12, and the problem disappear. so it's kubernetes 1.3's issue. |
On Fri, Aug 05, 2016 at 04:15:11AM -0700, Nugroho Herucahyono wrote:
A kernel bug seems more like a kernel issue :) What kernel version? Can you upgrade your kernel and see if the issue persist |
the same error happened:
server info: |
Anyone still observing this behavior? |
@cmluciano Using the workaround I posted it doesn't happen. And it seems that with newer kernels it also doesn't happen. Are you seeing it? which k8s, docker and kernel version? |
I have not, wondering if this issue should be closed |
@cmluciano oh, good point. Will close it, it can be reopened if relevant. Thanks! |
Hi,
I'm using a privileged container in a kubernetes pod to build images. The container runs docker 1.10.3. I'm using kubernetes 1.2.4 on AWS (setup with kube-up).
From time to time, a node crashes. Here is the output of the last crash at the end.
It seems this is the bug, and maybe it's related to using docker >= 1.10 on debian jessie kernel (although it is not confirmed) as reported here: moby/moby#21081
If this is the case, THIS PROBABLY AFFECTS kubernentes 1.3 that is due to be released.
cc @justinsb
The text was updated successfully, but these errors were encountered: