e2e flake: Daemon set should run and stop complex daemon #16623

a-robinson · 2015-10-30T21:55:27Z

This has been flaking pretty often lately - in 5 out of the last 11 runs of kubernetes-e2e-gce-flaky, to be specific. I've copied the output of one such failure run below.

Sorry if there's already an issue tracking this, I couldn't find anything other than #14072, which is for a different test.

@mikedanese

14:35:47 Daemon set 
14:35:47   should run and stop complex daemon
14:35:47   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:35:47 [BeforeEach] Daemon set
14:35:47   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:51
14:35:47 STEP: Creating a kubernetes client
14:35:47 >>> testContext.KubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce-flaky/workspace/.kube/config
14:35:47 STEP: Building a namespace api object
14:35:47 Oct 30 14:35:47.266: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-daemonsets-40nfz
14:35:47 Oct 30 14:35:47.268: INFO: Get service account default in ns e2e-tests-daemonsets-40nfz failed, ignoring for 2s: serviceaccounts "default" not found
14:35:49 Oct 30 14:35:49.271: INFO: Service account default in ns e2e-tests-daemonsets-40nfz with secrets found. (2.005191042s)
14:35:49 STEP: Waiting for a default service account to be provisioned in namespace
14:35:49 Oct 30 14:35:49.271: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-daemonsets-40nfz
14:35:49 Oct 30 14:35:49.273: INFO: Service account default in ns e2e-tests-daemonsets-40nfz with secrets found. (1.930327ms)
14:35:49 [BeforeEach] Daemon set
14:35:49   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:71
14:35:55 [It] should run and stop complex daemon
14:35:55   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:35:55 Oct 30 14:35:55.287: INFO: Creating daemon with a node selector daemon-set
14:35:55 STEP: Initially, daemon pods should not be running on any nodes.
14:35:57 Oct 30 14:35:57.297: INFO: nodesToPodCount: map[string]int{}
14:35:57 STEP: Change label of node, check that daemon pod is launched.
14:36:01 Oct 30 14:36:01.318: INFO: nodesToPodCount: map[string]int{}
14:36:03 Oct 30 14:36:03.347: INFO: nodesToPodCount: map[string]int{}
14:36:05 Oct 30 14:36:05.319: INFO: nodesToPodCount: map[string]int{}
14:36:07 Oct 30 14:36:07.319: INFO: nodesToPodCount: map[string]int{}
14:36:09 Oct 30 14:36:09.318: INFO: nodesToPodCount: map[string]int{}
14:36:11 Oct 30 14:36:11.318: INFO: nodesToPodCount: map[string]int{}
14:36:13 Oct 30 14:36:13.350: INFO: nodesToPodCount: map[string]int{}
14:36:15 Oct 30 14:36:15.318: INFO: nodesToPodCount: map[string]int{}
14:36:17 Oct 30 14:36:17.318: INFO: nodesToPodCount: map[string]int{}
14:36:19 Oct 30 14:36:19.318: INFO: nodesToPodCount: map[string]int{}
14:36:21 Oct 30 14:36:21.318: INFO: nodesToPodCount: map[string]int{}
14:36:23 Oct 30 14:36:23.347: INFO: nodesToPodCount: map[string]int{}
14:36:25 Oct 30 14:36:25.319: INFO: nodesToPodCount: map[string]int{}
14:36:27 Oct 30 14:36:27.318: INFO: nodesToPodCount: map[string]int{}
14:36:29 Oct 30 14:36:29.318: INFO: nodesToPodCount: map[string]int{}
14:36:31 Oct 30 14:36:31.350: INFO: nodesToPodCount: map[string]int{}
14:36:33 Oct 30 14:36:33.352: INFO: nodesToPodCount: map[string]int{}
14:36:35 Oct 30 14:36:35.318: INFO: nodesToPodCount: map[string]int{}
14:36:37 Oct 30 14:36:37.318: INFO: nodesToPodCount: map[string]int{}
14:36:39 Oct 30 14:36:39.318: INFO: nodesToPodCount: map[string]int{}
14:36:41 Oct 30 14:36:41.319: INFO: nodesToPodCount: map[string]int{}
14:36:43 Oct 30 14:36:43.318: INFO: nodesToPodCount: map[string]int{}
14:36:45 Oct 30 14:36:45.318: INFO: nodesToPodCount: map[string]int{}
14:36:47 Oct 30 14:36:47.319: INFO: nodesToPodCount: map[string]int{}
14:36:49 Oct 30 14:36:49.318: INFO: nodesToPodCount: map[string]int{}
14:36:51 Oct 30 14:36:51.318: INFO: nodesToPodCount: map[string]int{}
14:36:53 Oct 30 14:36:53.318: INFO: nodesToPodCount: map[string]int{}
14:36:55 Oct 30 14:36:55.318: INFO: nodesToPodCount: map[string]int{}
14:36:57 Oct 30 14:36:57.318: INFO: nodesToPodCount: map[string]int{}
14:36:59 Oct 30 14:36:59.319: INFO: nodesToPodCount: map[string]int{}
14:36:59 Oct 30 14:36:59.321: INFO: nodesToPodCount: map[string]int{}
14:36:59 [AfterEach] Daemon set
14:36:59   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:56
14:37:05 [AfterEach] Daemon set
14:37:05   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:52
14:37:05 STEP: Collecting events from namespace "e2e-tests-daemonsets-40nfz".
14:37:05 Oct 30 14:37:05.380: INFO: POD                                          NODE                   PHASE    GRACE  CONDITIONS
14:37:05 Oct 30 14:37:05.380: INFO: elasticsearch-logging-v1-jmu9g               e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:21 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: elasticsearch-logging-v1-z9x37               e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:54 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-7zwb  e2e-flaky-minion-7zwb  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:35:48 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-lojz  e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:14 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-ray5  e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:19 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: heapster-v10-crc4j                           e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:03 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kibana-logging-v1-e6e7p                      e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:19 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kube-dns-v9-dzbwq                            e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:54 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kube-ui-v3-hx16m                             e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:21 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: l7-lb-controller-nvr5o                       e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:53 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: monitoring-influxdb-grafana-v2-76koa         e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:42 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: 
14:37:05 Oct 30 14:37:05.380: INFO: Waiting up to 1m0s for all nodes to be ready
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-7zwb condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:35:08 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-7zwb readiness to be true
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-lojz condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:27:50 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-lojz readiness to be true
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-ray5 condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:27:54 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-ray5 readiness to be true
14:37:05 STEP: Destroying namespace "e2e-tests-daemonsets-40nfz" for this suite.
14:37:10 
14:37:10 • Failure [83.142 seconds]
14:37:10 Daemon set
14:37:10 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:182
14:37:10   should run and stop complex daemon [It]
14:37:10   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:37:10 
14:37:10   error waiting for daemon pods to be running on new nodes
14:37:10   Expected error:
14:37:10       <*errors.errorString | 0xc20814d4d0>: {
14:37:10           s: "timed out waiting for the condition",
14:37:10       }
14:37:10       timed out waiting for the condition
14:37:10   not to have occurred
14:37:10 
14:37:10   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:170

The text was updated successfully, but these errors were encountered:

mikedanese · 2015-10-30T22:01:34Z

Thanks for filing! I'll look into it.

gmarek · 2015-11-26T13:04:25Z

I haven't seen this test fail for quite some time. Removing it from the flaky suite.

gmarek · 2015-11-27T10:32:04Z

After moving back to the parallel build it flakes roughly 1/30 times.

cc @mikedanese @wojtek-t

ghost · 2015-12-07T21:51:24Z

The last 150 runs of this test in parallel have succeeded:

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/12529/testReport/junit/%28root%29/Kubernetes%20e2e%20suite/Daemon_set_should_run_and_stop_complex_daemon/history/

i.e. 0% flakey.

Prior to that (i.e. prior to 12/4, it flaked 3/45 runs (7%).
I'm assuming that something merged which fixed this. Closing for now.

janetkuo · 2015-12-09T19:03:41Z

I've seen this test failed more than once recently. I checked kubernetes-pull-build-test-e2e-gce test results, and this test failed 7 times in the last 150 runs.

wojtek-t · 2015-12-10T15:49:39Z

Here is new example of the failure:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-parallel/12635/

@mikedanese - can you please take a look?

mikedanese · 2015-12-11T00:23:31Z

Ya I've spent some time looking at this today. 210 runs and no flakes yet but I will continue to observe.

wojtek-t · 2015-12-11T12:02:47Z

Another failure:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-parallel/12681/

wojtek-t · 2015-12-11T14:13:09Z

And again:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-parallel/12685/

gmarek · 2015-12-11T14:19:40Z

Moving to flaky. Again. #18573

mikedanese · 2015-12-11T16:25:46Z

Thanks. I ran this 2076 times over night and it didn't flake. I think it might be interfering with a test that's in GCE_PARALLEL and not in GCE_PARALLEL_FLAKY hence not flaking when it's in the flaky suite. I will keep investigating today.

wojtek-t · 2015-12-11T16:53:55Z

Yeah - I'm pretty sure it's some kind of interaction. Maybe some timeouts are too small? Or sth like that?

wojtek-t · 2015-12-15T12:53:58Z

@mikedanese - did you have chance to look into it?

mikedanese · 2015-12-22T21:00:45Z

This hasn't flaked in a week

spxtr · 2015-12-29T20:40:32Z

This hasn't flaked in the last 50 builds on gce-flaky or gce-parallel-flaky. Should it be moved back?

mikedanese · 2015-12-29T21:02:31Z

@spxtr #19028

ixdy · 2016-01-13T23:19:20Z

Flaked again on #19614 (comment):

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
error waiting for daemon pods to be running on new nodes
Expected error:
    <*errors.errorString | 0xc20819e110>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

mikedanese · 2016-01-13T23:36:53Z

This test has failed 0 times in the last 100 runs of kubernetes-e2e-gce and twice in the last 651 kubernetes-e2e-gce-parallel. That's pretty good in comparison to other tests. I'm going to make this not a p0

mikedanese · 2016-01-21T18:36:12Z

This test remains 100% not flaky in e2e-gce and greater then 99% not flaky in gce-parallel. I'm probably going to deprioritize this until v1.2 is out the door.

dchen1107 · 2016-02-22T22:39:30Z

on failure instance: http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/29491/consoleFull

mikedanese · 2016-03-01T00:30:04Z

Fixed by #22219

a-robinson added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Oct 30, 2015

mikedanese self-assigned this Oct 30, 2015

wojtek-t added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 25, 2015

gmarek closed this as completed Nov 26, 2015

gmarek mentioned this issue Nov 26, 2015

Update GCE flaky suite #17831

Merged

gmarek reopened this Nov 27, 2015

ghost closed this as completed Dec 7, 2015

janetkuo reopened this Dec 9, 2015

wojtek-t mentioned this issue Dec 11, 2015

Move Daemon set should run complex daemon test to flaky #18573

Merged

mikedanese mentioned this issue Dec 17, 2015

increase timeout on complex daemonset test #18836

Merged

mikedanese closed this as completed Jan 5, 2016

ixdy reopened this Jan 13, 2016

mikedanese added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jan 13, 2016

mikedanese mentioned this issue Jan 25, 2016

e2e: bump daemonset reap timeout and add some logging dumps of the na… #20106

Merged

davidopp closed this as completed Jan 26, 2016

dchen1107 reopened this Feb 22, 2016

dchen1107 mentioned this issue Feb 22, 2016

restricting the version of docker to 1.9.x for vsphere-support #20851

Merged

mikedanese mentioned this issue Feb 24, 2016

PR merge failure: [Fail] Daemon set [It] should run and stop simple daemon #21767

Closed

mikedanese closed this as completed Mar 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Daemon set should run and stop complex daemon #16623

e2e flake: Daemon set should run and stop complex daemon #16623

a-robinson commented Oct 30, 2015

mikedanese commented Oct 30, 2015

gmarek commented Nov 26, 2015

gmarek commented Nov 27, 2015

ghost commented Dec 7, 2015

janetkuo commented Dec 9, 2015

wojtek-t commented Dec 10, 2015

mikedanese commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

gmarek commented Dec 11, 2015

mikedanese commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

wojtek-t commented Dec 15, 2015

mikedanese commented Dec 22, 2015

spxtr commented Dec 29, 2015

mikedanese commented Dec 29, 2015

ixdy commented Jan 13, 2016

mikedanese commented Jan 13, 2016

mikedanese commented Jan 21, 2016

dchen1107 commented Feb 22, 2016

mikedanese commented Mar 1, 2016

e2e flake: Daemon set should run and stop complex daemon #16623

e2e flake: Daemon set should run and stop complex daemon #16623

Comments

a-robinson commented Oct 30, 2015

mikedanese commented Oct 30, 2015

gmarek commented Nov 26, 2015

gmarek commented Nov 27, 2015

ghost commented Dec 7, 2015

janetkuo commented Dec 9, 2015

wojtek-t commented Dec 10, 2015

mikedanese commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

gmarek commented Dec 11, 2015

mikedanese commented Dec 11, 2015

wojtek-t commented Dec 11, 2015

wojtek-t commented Dec 15, 2015

mikedanese commented Dec 22, 2015

spxtr commented Dec 29, 2015

mikedanese commented Dec 29, 2015

ixdy commented Jan 13, 2016

mikedanese commented Jan 13, 2016

mikedanese commented Jan 21, 2016

dchen1107 commented Feb 22, 2016

mikedanese commented Mar 1, 2016