Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bumping ContainerVM image is breaking kubemark #25949

Closed
lavalamp opened this issue May 20, 2016 · 20 comments
Closed

Bumping ContainerVM image is breaking kubemark #25949

lavalamp opened this issue May 20, 2016 · 20 comments
Assignees
Labels
area/docker kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@lavalamp
Copy link
Member

Failed four times in a row now.

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/3307/

17:56:39 Timeout waiting for all HollowNodes to become Running
17:56:39 Found only 374 ready Nodes while waiting for 500.
17:56:39 + dump_cluster_logs_and_exit
17:56:39 + local -r exit_status=1
17:56:39 + [[ -x cluster/log-dump.sh ]]
17:56:39 + ./cluster/log-dump.sh /workspace/_artifacts
@lavalamp lavalamp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/flake Categorizes issue or PR as related to a flaky test. labels May 20, 2016
@lavalamp
Copy link
Member Author

One of these PRs is the culprit:

Bump GCE ContainerVM to container-v1-3-v20160517 (detail / githubweb)
Add an int64 quantity implementation (detail / githubweb)
Update use of Quantity in other classes (detail / githubweb)
Handle resource.Quantity changes in eviction thresholds (detail / githubweb)
Disable ppc64le for now (detail / githubweb)

@lavalamp
Copy link
Member Author

ok, all the hollow node pods got scheduled...

@lavalamp
Copy link
Member Author

Does the hollow master of the interior cluster not get logged?

@hongchaodeng
Copy link
Contributor

Does the hollow master of the interior cluster not get logged?

What do you mean by that?
I usually login to hollow master and see logs in "/var/log". log-dump.sh might work but you need to set up config of kubectl.

@lavalamp
Copy link
Member Author

@wojtek-t @gmarek -- folks in WAW

@lavalamp
Copy link
Member Author

@hongchaodeng that's the outer master, is there no inner master? what do the hollow nodes register with?

@wojtek-t
Copy link
Member

yes - the hollow-node register with inner master;

Let me take a look

@bprashanth
Copy link
Contributor

fishy logs
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-kubemark-500-gce/3307/artifacts/kubemark-500-minion-iw4q/docker.log

time="2016-05-20T00:42:41.621114779Z" level=error msg="Handler for POST /containers/0435ece7dc6418063290220351e75a358edd46ac4b0657538f866da67046c866/start returned error: rpc error: code = 2 desc = \"oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded\"" 
time="2016-05-20T00:42:41.668931826Z" level=warning msg="Security options with `:` as a separator are deprecated and will be completely unsupported in 1.13, use `=` instead." 
time="2016-05-20T00:42:41Z" level=error msg="containerd: start container" error="oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded" id=fce1cb0ec7b63422e4754e86e3d4ae2fbdfbb749fbdf542b42d14dd1c8519073 

@wojtek-t
Copy link
Member

Looking into it.

It seems that only ~370-380 kubelets are always registered in the kubemark master.

@wojtek-t
Copy link
Member

What @bprashanth wrote above seems like a problem.
I logged into a random machine and there were a bunch of created containers there, but those containers never started.

@wojtek-t
Copy link
Member

To be honest, those errors suggest that maybe this PR is in fact the victim:

Bump GCE ContainerVM to container-v1-3-v20160517 (detail / githubweb)

I will try to verify it.

@wojtek-t
Copy link
Member

I other words, it seems to me that the problem is somewhere around kubelet/docker/node and the change of quantity by @smarterclayton doesn't seem to affect that.

On the other, the second change clearly touches that part of the system.

@wojtek-t
Copy link
Member

So far I verified that I'm getting exactly the same issues when running locally.
I'm going to locally revert the bump of container VM and see if that helps.

@wojtek-t
Copy link
Member

OK - so I locally reverted the bum of container Vm PR:
#25843

and this fixes the problem for me locally.
I'm going to revert this PR.

@wojtek-t wojtek-t changed the title kubemark-500 Timeout waiting for all HollowNodes to become Running Bumping ContainerVM image is breaking kubemark May 20, 2016
@wojtek-t
Copy link
Member

OK - reverting to the previous ContainerVM image solve the problem.

So we clearly have some incompatibility between 1.9 and 1.11 Dockers which I don't have time to debug.
If we want to enable Docker 1.11 we definitely need to understand it (and if that's configurational issue, fix kubemarks).

So I'm reassigning this to @dchen1107 for desicion how to proceed with this.

@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 20, 2016
@wojtek-t wojtek-t assigned dchen1107 and unassigned wojtek-t May 20, 2016
@yujuhong
Copy link
Contributor

kubelet.log is full of errors like this, as the other issue suggested:

time="2016-05-20T00:48:41Z" level=error msg="containerd: start container" error="oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded" id=834e73748fd5462f76359ca413bf6902cb4316a670dc9f3a2053577b2fc9556a 

It seems to be caused by the fact that runc creates a unique session key for each container.
opencontainers/runc#726

@dchen1107
Copy link
Member

Didn't see this issue, I updated with #25951 (comment)

@dchen1107
Copy link
Member

If we reached that quota limit, how come the our docker performance tests & e2e density tests doesn't catch the issue?

@dchen1107
Copy link
Member

I am closing one since the image is revert back, and the flakiness caused by docker 1.11.1 is gone. Please move the discussion related to docker 1.11.1 validation / testing to #23397, and the ones which docker release going with Kubernetes 1.3 to #25893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants