e2e flake: Pod Disks tests are very flaky #26076

piosz · 2016-05-23T13:52:33Z

The following tests failed multiple times both in GCE and GKE suites:

Pod Disks should schedule a pod w/two RW PDs both mounted to one container
Pod Disks should schedule a pod w/ a RW PD shared between multiple containers

The error message is the same:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271 Expected error: <*exec.ExitError | 0xc8207d7c60>: { ProcessState: { pid: 5426, status: 256, rusage: { Utime: {Sec: 0, Usec: 92000}, Stime: {Sec: 0, Usec: 16000}, Maxrss: 29244, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 2153, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 725, Nivcsw: 5, }, }, Stderr: nil, } exit status 1 not to have occurred

Logs:
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5896
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5899
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5903
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5904

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4491
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4495
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4496
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4497
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4500
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4503

The text was updated successfully, but these errors were encountered:

piosz · 2016-05-23T13:53:43Z

Assigning @thockin for fix or triage.
cc @kubernetes/sig-storage

rootfs · 2016-05-23T14:56:50Z

I can work on this, as I am writing tests for openstack in pd.go.

piosz · 2016-05-23T15:23:23Z

It would be great!

thockin · 2016-05-23T16:05:21Z

Huamin,

Thanks for jumping on this. Please consider it priority - flaky tests are
killing us, and if this turns out to be real, we need it fixed.

On Mon, May 23, 2016 at 7:57 AM, Huamin Chen notifications@github.com
wrote:

I can work on this, as I am writing tests for openstack in pd.go.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#26076 (comment)

rootfs · 2016-05-23T16:35:09Z

sure

spxtr · 2016-05-23T17:39:44Z

These two tests are so flaky that they will block the merge queue for over half of the time it takes to merge a fix. If you think the fix will take more than a day, I'd suggest moving these to the flaky suite until they're fixed.

ixdy · 2016-05-23T17:42:13Z

I'd suggest tagging them flaky ASAP. Whenever they get fixed they can be untagged.

pmorie · 2016-05-23T17:54:55Z

I am okay with moving these to the flaky suite until we can get some focus on it. @childsb

spxtr · 2016-05-23T18:10:18Z

PR is #26089, will likely need a manual merge since the SQ is so blocked.

saad-ali · 2016-05-23T18:22:32Z

Hang on this looks like it was broken by a PR merged on Sunday: #21709

Test Run http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-slow/5873/ and earlier are Green.
Test Run http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-slow/5874/ (after this PR was merged) are Flaky.

CC @swagiaal

I'll prepare a roll back.

rootfs · 2016-05-23T18:27:17Z

@saad-ali good catch, i'll take a look at the gce attacher.

rootfs · 2016-05-23T19:33:35Z

@saad-ali give me some moments to figure out a fix

spxtr · 2016-05-24T00:02:21Z

I think the appropriate thing to do here was to move them into the flaky suite or revert the offending PR right away, not wait several hours for the fix to pass code review and CI.

saad-ali · 2016-05-24T00:13:27Z

Agreed, this could've been handled better.

saad-ali · 2016-05-24T00:29:09Z

PR marking tests as flaky (#26089) has been merged. @rootfs will follow up with his PR (#26100) to see if he can fix the test, if so, he will move the tests back out of flaky.

lavalamp · 2016-05-24T06:14:31Z

Occurrences tracked automatically in #26127. I don't care if you leave this one open too.

saad-ali · 2016-05-24T06:39:24Z

Closing this in favor autogenerated #26127

Automatic merge from submit-queue in e2e test, when kubectl exec fails to find the container to run a command, it should retry fix #26076 Without retrying upon "container not found" error, `Pod Disks` test failed on the following error: ```console [k8s.io] Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271 [BeforeEach] [k8s.io] Pod Disks /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:108 STEP: Creating a kubernetes client May 23 19:18:02.254: INFO: >>> TestContext.KubeConfig: /root/.kube/config STEP: Building a namespace api object STEP: Waiting for a default service account to be provisioned in namespace [BeforeEach] [k8s.io] Pod Disks /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:69 [It] should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271 STEP: creating PD1 May 23 19:18:06.678: INFO: Successfully created a new PD: "rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c". STEP: creating PD2 May 23 19:18:11.216: INFO: Successfully created a new PD: "rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c". May 23 19:18:11.216: INFO: PD Read/Writer Iteration #0 STEP: submitting host0Pod to kubernetes W0523 19:18:11.279910 4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly. STEP: writing a file in the container May 23 19:18:39.088: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1394466581702052925' > '/testpd1/tracker0'' May 23 19:18:40.250: INFO: Wrote value: "1394466581702052925" to PD1 ("rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer" STEP: writing a file in the container May 23 19:18:40.251: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1740704063962701662' > '/testpd2/tracker0'' May 23 19:18:41.433: INFO: Wrote value: "1740704063962701662" to PD2 ("rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer" STEP: reading a file in the container May 23 19:18:41.433: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0' May 23 19:18:42.585: INFO: Read file "/testpd1/tracker0" with content: 1394466581702052925 STEP: reading a file in the container May 23 19:18:42.585: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd2/tracker0' May 23 19:18:43.779: INFO: Read file "/testpd2/tracker0" with content: 1740704063962701662 STEP: deleting host0Pod May 23 19:18:44.048: INFO: PD Read/Writer Iteration #1 STEP: submitting host0Pod to kubernetes W0523 19:18:44.132475 4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly. STEP: reading a file in the container May 23 19:18:45.186: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0' May 23 19:18:46.290: INFO: error running kubectl exec to read file: exit status 1 stdout= stderr=error: error executing remote command: error executing command in container: container not found ("mycontainer") ) May 23 19:18:46.290: INFO: Error reading file: exit status 1 May 23 19:18:46.290: INFO: Unexpected error occurred: exit status 1 ``` Now I've run this fix on e2e pd test 5 times and no longer see any failure

piosz added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. team/cluster kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016

piosz assigned thockin May 23, 2016

spxtr mentioned this issue May 23, 2016

Move PD tests to flaky suite. #26089

Merged

This was referenced May 23, 2016

Detangle attach detach from gce #21709

Merged

Revert "Detangle attach detach from gce" #26091

Closed

rootfs mentioned this issue May 23, 2016

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

Merged

saad-ali mentioned this issue May 24, 2016

[k8s.io] Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow] {Kubernetes e2e suite} #26127

Closed

saad-ali mentioned this issue May 24, 2016

Fix PD tests and move out of flaky #26141

Closed

saad-ali closed this as completed May 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Pod Disks tests are very flaky #26076

e2e flake: Pod Disks tests are very flaky #26076

piosz commented May 23, 2016

piosz commented May 23, 2016

rootfs commented May 23, 2016

piosz commented May 23, 2016

thockin commented May 23, 2016

rootfs commented May 23, 2016

spxtr commented May 23, 2016

ixdy commented May 23, 2016

pmorie commented May 23, 2016

spxtr commented May 23, 2016

saad-ali commented May 23, 2016

rootfs commented May 23, 2016

rootfs commented May 23, 2016

spxtr commented May 24, 2016 •

edited

Loading

saad-ali commented May 24, 2016

saad-ali commented May 24, 2016

lavalamp commented May 24, 2016

saad-ali commented May 24, 2016

e2e flake: Pod Disks tests are very flaky #26076

e2e flake: Pod Disks tests are very flaky #26076

Comments

piosz commented May 23, 2016

piosz commented May 23, 2016

rootfs commented May 23, 2016

piosz commented May 23, 2016

thockin commented May 23, 2016

rootfs commented May 23, 2016

spxtr commented May 23, 2016

ixdy commented May 23, 2016

pmorie commented May 23, 2016

spxtr commented May 23, 2016

saad-ali commented May 23, 2016

rootfs commented May 23, 2016

rootfs commented May 23, 2016

spxtr commented May 24, 2016 • edited Loading

saad-ali commented May 24, 2016

saad-ali commented May 24, 2016

lavalamp commented May 24, 2016

saad-ali commented May 24, 2016

spxtr commented May 24, 2016 •

edited

Loading