Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Daemon set should run and stop complex daemon #16623

Closed
a-robinson opened this issue Oct 30, 2015 · 21 comments
Closed

e2e flake: Daemon set should run and stop complex daemon #16623

a-robinson opened this issue Oct 30, 2015 · 21 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@a-robinson
Copy link
Contributor

This has been flaking pretty often lately - in 5 out of the last 11 runs of kubernetes-e2e-gce-flaky, to be specific. I've copied the output of one such failure run below.

Sorry if there's already an issue tracking this, I couldn't find anything other than #14072, which is for a different test.

@mikedanese

14:35:47 Daemon set 
14:35:47   should run and stop complex daemon
14:35:47   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:35:47 [BeforeEach] Daemon set
14:35:47   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:51
14:35:47 STEP: Creating a kubernetes client
14:35:47 >>> testContext.KubeConfig: /var/lib/jenkins/jobs/kubernetes-e2e-gce-flaky/workspace/.kube/config
14:35:47 STEP: Building a namespace api object
14:35:47 Oct 30 14:35:47.266: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-daemonsets-40nfz
14:35:47 Oct 30 14:35:47.268: INFO: Get service account default in ns e2e-tests-daemonsets-40nfz failed, ignoring for 2s: serviceaccounts "default" not found
14:35:49 Oct 30 14:35:49.271: INFO: Service account default in ns e2e-tests-daemonsets-40nfz with secrets found. (2.005191042s)
14:35:49 STEP: Waiting for a default service account to be provisioned in namespace
14:35:49 Oct 30 14:35:49.271: INFO: Waiting up to 2m0s for service account default to be provisioned in ns e2e-tests-daemonsets-40nfz
14:35:49 Oct 30 14:35:49.273: INFO: Service account default in ns e2e-tests-daemonsets-40nfz with secrets found. (1.930327ms)
14:35:49 [BeforeEach] Daemon set
14:35:49   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:71
14:35:55 [It] should run and stop complex daemon
14:35:55   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:35:55 Oct 30 14:35:55.287: INFO: Creating daemon with a node selector daemon-set
14:35:55 STEP: Initially, daemon pods should not be running on any nodes.
14:35:57 Oct 30 14:35:57.297: INFO: nodesToPodCount: map[string]int{}
14:35:57 STEP: Change label of node, check that daemon pod is launched.
14:36:01 Oct 30 14:36:01.318: INFO: nodesToPodCount: map[string]int{}
14:36:03 Oct 30 14:36:03.347: INFO: nodesToPodCount: map[string]int{}
14:36:05 Oct 30 14:36:05.319: INFO: nodesToPodCount: map[string]int{}
14:36:07 Oct 30 14:36:07.319: INFO: nodesToPodCount: map[string]int{}
14:36:09 Oct 30 14:36:09.318: INFO: nodesToPodCount: map[string]int{}
14:36:11 Oct 30 14:36:11.318: INFO: nodesToPodCount: map[string]int{}
14:36:13 Oct 30 14:36:13.350: INFO: nodesToPodCount: map[string]int{}
14:36:15 Oct 30 14:36:15.318: INFO: nodesToPodCount: map[string]int{}
14:36:17 Oct 30 14:36:17.318: INFO: nodesToPodCount: map[string]int{}
14:36:19 Oct 30 14:36:19.318: INFO: nodesToPodCount: map[string]int{}
14:36:21 Oct 30 14:36:21.318: INFO: nodesToPodCount: map[string]int{}
14:36:23 Oct 30 14:36:23.347: INFO: nodesToPodCount: map[string]int{}
14:36:25 Oct 30 14:36:25.319: INFO: nodesToPodCount: map[string]int{}
14:36:27 Oct 30 14:36:27.318: INFO: nodesToPodCount: map[string]int{}
14:36:29 Oct 30 14:36:29.318: INFO: nodesToPodCount: map[string]int{}
14:36:31 Oct 30 14:36:31.350: INFO: nodesToPodCount: map[string]int{}
14:36:33 Oct 30 14:36:33.352: INFO: nodesToPodCount: map[string]int{}
14:36:35 Oct 30 14:36:35.318: INFO: nodesToPodCount: map[string]int{}
14:36:37 Oct 30 14:36:37.318: INFO: nodesToPodCount: map[string]int{}
14:36:39 Oct 30 14:36:39.318: INFO: nodesToPodCount: map[string]int{}
14:36:41 Oct 30 14:36:41.319: INFO: nodesToPodCount: map[string]int{}
14:36:43 Oct 30 14:36:43.318: INFO: nodesToPodCount: map[string]int{}
14:36:45 Oct 30 14:36:45.318: INFO: nodesToPodCount: map[string]int{}
14:36:47 Oct 30 14:36:47.319: INFO: nodesToPodCount: map[string]int{}
14:36:49 Oct 30 14:36:49.318: INFO: nodesToPodCount: map[string]int{}
14:36:51 Oct 30 14:36:51.318: INFO: nodesToPodCount: map[string]int{}
14:36:53 Oct 30 14:36:53.318: INFO: nodesToPodCount: map[string]int{}
14:36:55 Oct 30 14:36:55.318: INFO: nodesToPodCount: map[string]int{}
14:36:57 Oct 30 14:36:57.318: INFO: nodesToPodCount: map[string]int{}
14:36:59 Oct 30 14:36:59.319: INFO: nodesToPodCount: map[string]int{}
14:36:59 Oct 30 14:36:59.321: INFO: nodesToPodCount: map[string]int{}
14:36:59 [AfterEach] Daemon set
14:36:59   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:56
14:37:05 [AfterEach] Daemon set
14:37:05   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:52
14:37:05 STEP: Collecting events from namespace "e2e-tests-daemonsets-40nfz".
14:37:05 Oct 30 14:37:05.380: INFO: POD                                          NODE                   PHASE    GRACE  CONDITIONS
14:37:05 Oct 30 14:37:05.380: INFO: elasticsearch-logging-v1-jmu9g               e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:21 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: elasticsearch-logging-v1-z9x37               e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:54 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-7zwb  e2e-flaky-minion-7zwb  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:35:48 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-lojz  e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:14 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: fluentd-elasticsearch-e2e-flaky-minion-ray5  e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:19 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: heapster-v10-crc4j                           e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:03 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kibana-logging-v1-e6e7p                      e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:34:19 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kube-dns-v9-dzbwq                            e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:54 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: kube-ui-v3-hx16m                             e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:21 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: l7-lb-controller-nvr5o                       e2e-flaky-minion-ray5  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:53 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: monitoring-influxdb-grafana-v2-76koa         e2e-flaky-minion-lojz  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2015-10-30 14:28:42 -0700 PDT  }]
14:37:05 Oct 30 14:37:05.380: INFO: 
14:37:05 Oct 30 14:37:05.380: INFO: Waiting up to 1m0s for all nodes to be ready
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-7zwb condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:35:08 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-7zwb readiness to be true
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-lojz condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:27:50 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-lojz readiness to be true
14:37:05 Oct 30 14:37:05.384: INFO: Node e2e-flaky-minion-ray5 condition 1/2: type: Ready, status: True, reason: "KubeletReady", message: "kubelet is posting ready status", last transition time: 2015-10-30 14:27:54 -0700 PDT
14:37:05 Oct 30 14:37:05.384: INFO: Successfully found node e2e-flaky-minion-ray5 readiness to be true
14:37:05 STEP: Destroying namespace "e2e-tests-daemonsets-40nfz" for this suite.
14:37:10 
14:37:10 • Failure [83.142 seconds]
14:37:10 Daemon set
14:37:10 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:182
14:37:10   should run and stop complex daemon [It]
14:37:10   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
14:37:10 
14:37:10   error waiting for daemon pods to be running on new nodes
14:37:10   Expected error:
14:37:10       <*errors.errorString | 0xc20814d4d0>: {
14:37:10           s: "timed out waiting for the condition",
14:37:10       }
14:37:10       timed out waiting for the condition
14:37:10   not to have occurred
14:37:10 
14:37:10   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:170
@a-robinson a-robinson added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Oct 30, 2015
@mikedanese
Copy link
Member

Thanks for filing! I'll look into it.

@mikedanese mikedanese self-assigned this Oct 30, 2015
@wojtek-t wojtek-t added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 25, 2015
@gmarek
Copy link
Contributor

gmarek commented Nov 26, 2015

I haven't seen this test fail for quite some time. Removing it from the flaky suite.

@gmarek
Copy link
Contributor

gmarek commented Nov 27, 2015

After moving back to the parallel build it flakes roughly 1/30 times.

cc @mikedanese @wojtek-t

@gmarek gmarek reopened this Nov 27, 2015
@ghost
Copy link

ghost commented Dec 7, 2015

The last 150 runs of this test in parallel have succeeded:

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/12529/testReport/junit/%28root%29/Kubernetes%20e2e%20suite/Daemon_set_should_run_and_stop_complex_daemon/history/

i.e. 0% flakey.

Prior to that (i.e. prior to 12/4, it flaked 3/45 runs (7%).
I'm assuming that something merged which fixed this. Closing for now.

@ghost ghost closed this as completed Dec 7, 2015
@janetkuo
Copy link
Member

janetkuo commented Dec 9, 2015

I've seen this test failed more than once recently. I checked kubernetes-pull-build-test-e2e-gce test results, and this test failed 7 times in the last 150 runs.

@janetkuo janetkuo reopened this Dec 9, 2015
@wojtek-t
Copy link
Member

Here is new example of the failure:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-parallel/12635/

@mikedanese - can you please take a look?

@mikedanese
Copy link
Member

Ya I've spent some time looking at this today. 210 runs and no flakes yet but I will continue to observe.

@wojtek-t
Copy link
Member

@wojtek-t
Copy link
Member

@gmarek
Copy link
Contributor

gmarek commented Dec 11, 2015

Moving to flaky. Again. #18573

@mikedanese
Copy link
Member

Thanks. I ran this 2076 times over night and it didn't flake. I think it might be interfering with a test that's in GCE_PARALLEL and not in GCE_PARALLEL_FLAKY hence not flaking when it's in the flaky suite. I will keep investigating today.

@wojtek-t
Copy link
Member

Yeah - I'm pretty sure it's some kind of interaction. Maybe some timeouts are too small? Or sth like that?

@wojtek-t
Copy link
Member

@mikedanese - did you have chance to look into it?

@mikedanese
Copy link
Member

This hasn't flaked in a week

@spxtr
Copy link
Contributor

spxtr commented Dec 29, 2015

This hasn't flaked in the last 50 builds on gce-flaky or gce-parallel-flaky. Should it be moved back?

@mikedanese
Copy link
Member

@spxtr #19028

@ixdy
Copy link
Member

ixdy commented Jan 13, 2016

Flaked again on #19614 (comment):

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_set.go:181
error waiting for daemon pods to be running on new nodes
Expected error:
    <*errors.errorString | 0xc20819e110>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
not to have occurred

@ixdy ixdy reopened this Jan 13, 2016
@mikedanese
Copy link
Member

This test has failed 0 times in the last 100 runs of kubernetes-e2e-gce and twice in the last 651 kubernetes-e2e-gce-parallel. That's pretty good in comparison to other tests. I'm going to make this not a p0

@mikedanese mikedanese added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jan 13, 2016
@mikedanese
Copy link
Member

This test remains 100% not flaky in e2e-gce and greater then 99% not flaky in gce-parallel. I'm probably going to deprioritize this until v1.2 is out the door.

@mikedanese mikedanese added priority/backlog Higher priority than priority/awaiting-more-evidence. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 21, 2016
@dchen1107
Copy link
Member

@mikedanese
Copy link
Member

Fixed by #22219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

9 participants