Bumping ContainerVM image is breaking kubemark #25949

lavalamp · 2016-05-20T05:15:18Z

Failed four times in a row now.

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-kubemark-500-gce/3307/

17:56:39 Timeout waiting for all HollowNodes to become Running
17:56:39 Found only 374 ready Nodes while waiting for 500.
17:56:39 + dump_cluster_logs_and_exit
17:56:39 + local -r exit_status=1
17:56:39 + [[ -x cluster/log-dump.sh ]]
17:56:39 + ./cluster/log-dump.sh /workspace/_artifacts

The text was updated successfully, but these errors were encountered:

lavalamp · 2016-05-20T05:26:45Z

One of these PRs is the culprit:

Bump GCE ContainerVM to container-v1-3-v20160517 (detail / githubweb)
Add an int64 quantity implementation (detail / githubweb)
Update use of Quantity in other classes (detail / githubweb)
Handle resource.Quantity changes in eviction thresholds (detail / githubweb)
Disable ppc64le for now (detail / githubweb)

lavalamp · 2016-05-20T05:44:50Z

ok, all the hollow node pods got scheduled...

lavalamp · 2016-05-20T05:51:54Z

Does the hollow master of the interior cluster not get logged?

hongchaodeng · 2016-05-20T05:57:19Z

Does the hollow master of the interior cluster not get logged?

What do you mean by that?
I usually login to hollow master and see logs in "/var/log". log-dump.sh might work but you need to set up config of kubectl.

lavalamp · 2016-05-20T05:57:41Z

@wojtek-t @gmarek -- folks in WAW

hongchaodeng · 2016-05-20T05:58:51Z

Hope this would help:
https://github.com/kubernetes/kubernetes/blob/master/docs/devel/kubemark-guide.md#monitoring-test-execution-and-debugging-problems

lavalamp · 2016-05-20T06:00:51Z

@hongchaodeng that's the outer master, is there no inner master? what do the hollow nodes register with?

wojtek-t · 2016-05-20T06:02:05Z

yes - the hollow-node register with inner master;

Let me take a look

bprashanth · 2016-05-20T06:05:25Z

fishy logs
https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-kubemark-500-gce/3307/artifacts/kubemark-500-minion-iw4q/docker.log

time="2016-05-20T00:42:41.621114779Z" level=error msg="Handler for POST /containers/0435ece7dc6418063290220351e75a358edd46ac4b0657538f866da67046c866/start returned error: rpc error: code = 2 desc = \"oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded\"" 
time="2016-05-20T00:42:41.668931826Z" level=warning msg="Security options with `:` as a separator are deprecated and will be completely unsupported in 1.13, use `=` instead." 
time="2016-05-20T00:42:41Z" level=error msg="containerd: start container" error="oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded" id=fce1cb0ec7b63422e4754e86e3d4ae2fbdfbb749fbdf542b42d14dd1c8519073

wojtek-t · 2016-05-20T06:10:37Z

Looking into it.

It seems that only ~370-380 kubelets are always registered in the kubemark master.

wojtek-t · 2016-05-20T06:14:54Z

What @bprashanth wrote above seems like a problem.
I logged into a random machine and there were a bunch of created containers there, but those containers never started.

wojtek-t · 2016-05-20T06:18:15Z

To be honest, those errors suggest that maybe this PR is in fact the victim:

Bump GCE ContainerVM to container-v1-3-v20160517 (detail / githubweb)

I will try to verify it.

wojtek-t · 2016-05-20T06:34:59Z

I other words, it seems to me that the problem is somewhere around kubelet/docker/node and the change of quantity by @smarterclayton doesn't seem to affect that.

On the other, the second change clearly touches that part of the system.

wojtek-t · 2016-05-20T06:48:04Z

So far I verified that I'm getting exactly the same issues when running locally.
I'm going to locally revert the bump of container VM and see if that helps.

wojtek-t · 2016-05-20T07:02:03Z

OK - so I locally reverted the bum of container Vm PR:
#25843

and this fixes the problem for me locally.
I'm going to revert this PR.

wojtek-t · 2016-05-20T10:02:19Z

OK - reverting to the previous ContainerVM image solve the problem.

So we clearly have some incompatibility between 1.9 and 1.11 Dockers which I don't have time to debug.
If we want to enable Docker 1.11 we definitely need to understand it (and if that's configurational issue, fix kubemarks).

So I'm reassigning this to @dchen1107 for desicion how to proceed with this.

yujuhong · 2016-05-20T16:52:37Z

kubelet.log is full of errors like this, as the other issue suggested:

time="2016-05-20T00:48:41Z" level=error msg="containerd: start container" error="oci runtime error: could not synchronise with container process: could not create session key: disk quota exceeded" id=834e73748fd5462f76359ca413bf6902cb4316a670dc9f3a2053577b2fc9556a

It seems to be caused by the fact that runc creates a unique session key for each container.
opencontainers/runc#726

dchen1107 · 2016-05-20T18:00:01Z

Didn't see this issue, I updated with #25951 (comment)

dchen1107 · 2016-05-20T18:03:38Z

If we reached that quota limit, how come the our docker performance tests & e2e density tests doesn't catch the issue?

dchen1107 · 2016-05-20T18:11:42Z

I am closing one since the image is revert back, and the flakiness caused by docker 1.11.1 is gone. Please move the discussion related to docker 1.11.1 validation / testing to #23397, and the ones which docker release going with Kubernetes 1.3 to #25893

lavalamp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/flake Categorizes issue or PR as related to a flaky test. labels May 20, 2016

lavalamp mentioned this issue May 20, 2016

scheduler log: useless error message #25950

Closed

bprashanth mentioned this issue May 20, 2016

kubemark-500 hollow nodes don't come up #25948

Closed

lavalamp assigned wojtek-t May 20, 2016

wojtek-t changed the title ~~kubemark-500 Timeout waiting for all HollowNodes to become Running~~ Bumping ContainerVM image is breaking kubemark May 20, 2016

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 20, 2016

wojtek-t assigned dchen1107 and unassigned wojtek-t May 20, 2016

yujuhong mentioned this issue May 20, 2016

Revert "Bump GCE ContainerVM to container-v1-3-v20160517 (Docker 1.11.1)" #25951

Merged

dchen1107 added the area/docker label May 20, 2016

dchen1107 closed this as completed May 20, 2016

dchen1107 mentioned this issue May 20, 2016

Validate Docker v1.11 #23397

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bumping ContainerVM image is breaking kubemark #25949

Bumping ContainerVM image is breaking kubemark #25949

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

hongchaodeng commented May 20, 2016

lavalamp commented May 20, 2016

hongchaodeng commented May 20, 2016

lavalamp commented May 20, 2016

wojtek-t commented May 20, 2016

bprashanth commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

yujuhong commented May 20, 2016

dchen1107 commented May 20, 2016

dchen1107 commented May 20, 2016

dchen1107 commented May 20, 2016

Bumping ContainerVM image is breaking kubemark #25949

Bumping ContainerVM image is breaking kubemark #25949

Comments

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

lavalamp commented May 20, 2016

hongchaodeng commented May 20, 2016

lavalamp commented May 20, 2016

hongchaodeng commented May 20, 2016

lavalamp commented May 20, 2016

wojtek-t commented May 20, 2016

bprashanth commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

wojtek-t commented May 20, 2016

yujuhong commented May 20, 2016

dchen1107 commented May 20, 2016

dchen1107 commented May 20, 2016

dchen1107 commented May 20, 2016