Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Pod Disks tests are very flaky #26076

Closed
piosz opened this issue May 23, 2016 · 17 comments · Fixed by #26100
Closed

e2e flake: Pod Disks tests are very flaky #26076

piosz opened this issue May 23, 2016 · 17 comments · Fixed by #26100
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@piosz
Copy link
Member

piosz commented May 23, 2016

The following tests failed multiple times both in GCE and GKE suites:

Pod Disks should schedule a pod w/two RW PDs both mounted to one container
Pod Disks should schedule a pod w/ a RW PD shared between multiple containers

The error message is the same:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271 Expected error: <*exec.ExitError | 0xc8207d7c60>: { ProcessState: { pid: 5426, status: 256, rusage: { Utime: {Sec: 0, Usec: 92000}, Stime: {Sec: 0, Usec: 16000}, Maxrss: 29244, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 2153, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 725, Nivcsw: 5, }, }, Stderr: nil, } exit status 1 not to have occurred

Logs:
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5896
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5899
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5903
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/5904

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4491
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4495
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4496
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4497
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4500
https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gke-slow/4503

@piosz piosz added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. team/cluster kind/flake Categorizes issue or PR as related to a flaky test. labels May 23, 2016
@piosz
Copy link
Member Author

piosz commented May 23, 2016

Assigning @thockin for fix or triage.
cc @kubernetes/sig-storage

@rootfs
Copy link
Contributor

rootfs commented May 23, 2016

I can work on this, as I am writing tests for openstack in pd.go.

@piosz
Copy link
Member Author

piosz commented May 23, 2016

It would be great!

@thockin
Copy link
Member

thockin commented May 23, 2016

Huamin,

Thanks for jumping on this. Please consider it priority - flaky tests are
killing us, and if this turns out to be real, we need it fixed.

On Mon, May 23, 2016 at 7:57 AM, Huamin Chen notifications@github.com
wrote:

I can work on this, as I am writing tests for openstack in pd.go.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#26076 (comment)

@rootfs
Copy link
Contributor

rootfs commented May 23, 2016

sure

@spxtr
Copy link
Contributor

spxtr commented May 23, 2016

These two tests are so flaky that they will block the merge queue for over half of the time it takes to merge a fix. If you think the fix will take more than a day, I'd suggest moving these to the flaky suite until they're fixed.

@ixdy
Copy link
Member

ixdy commented May 23, 2016

I'd suggest tagging them flaky ASAP. Whenever they get fixed they can be untagged.

@pmorie
Copy link
Member

pmorie commented May 23, 2016

I am okay with moving these to the flaky suite until we can get some focus on it. @childsb

@spxtr
Copy link
Contributor

spxtr commented May 23, 2016

PR is #26089, will likely need a manual merge since the SQ is so blocked.

@saad-ali
Copy link
Member

Hang on this looks like it was broken by a PR merged on Sunday: #21709

Test Run http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-slow/5873/ and earlier are Green.
Test Run http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-slow/5874/ (after this PR was merged) are Flaky.

CC @swagiaal

I'll prepare a roll back.

@rootfs
Copy link
Contributor

rootfs commented May 23, 2016

@saad-ali good catch, i'll take a look at the gce attacher.

@rootfs
Copy link
Contributor

rootfs commented May 23, 2016

@saad-ali give me some moments to figure out a fix

@spxtr
Copy link
Contributor

spxtr commented May 24, 2016

I think the appropriate thing to do here was to move them into the flaky suite or revert the offending PR right away, not wait several hours for the fix to pass code review and CI.

@saad-ali
Copy link
Member

Agreed, this could've been handled better.

@saad-ali
Copy link
Member

PR marking tests as flaky (#26089) has been merged. @rootfs will follow up with his PR (#26100) to see if he can fix the test, if so, he will move the tests back out of flaky.

@lavalamp
Copy link
Member

Occurrences tracked automatically in #26127. I don't care if you leave this one open too.

@saad-ali
Copy link
Member

Closing this in favor autogenerated #26127

k8s-github-robot pushed a commit that referenced this issue May 24, 2016
Automatic merge from submit-queue

in e2e test, when kubectl exec fails to find the container to run a command, it should retry

fix #26076 
Without retrying upon "container not found" error, `Pod Disks` test failed on the following error:
```console
[k8s.io] Pod Disks 
  should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:108
STEP: Creating a kubernetes client
May 23 19:18:02.254: INFO: >>> TestContext.KubeConfig: /root/.kube/config

STEP: Building a namespace api object
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:69
[It] should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
STEP: creating PD1
May 23 19:18:06.678: INFO: Successfully created a new PD: "rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c".
STEP: creating PD2
May 23 19:18:11.216: INFO: Successfully created a new PD: "rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c".
May 23 19:18:11.216: INFO: PD Read/Writer Iteration #0
STEP: submitting host0Pod to kubernetes
W0523 19:18:11.279910    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: writing a file in the container
May 23 19:18:39.088: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1394466581702052925' > '/testpd1/tracker0''
May 23 19:18:40.250: INFO: Wrote value: "1394466581702052925" to PD1 ("rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: writing a file in the container
May 23 19:18:40.251: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1740704063962701662' > '/testpd2/tracker0''
May 23 19:18:41.433: INFO: Wrote value: "1740704063962701662" to PD2 ("rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: reading a file in the container
May 23 19:18:41.433: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:42.585: INFO: Read file "/testpd1/tracker0" with content: 1394466581702052925

STEP: reading a file in the container
May 23 19:18:42.585: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd2/tracker0'
May 23 19:18:43.779: INFO: Read file "/testpd2/tracker0" with content: 1740704063962701662

STEP: deleting host0Pod
May 23 19:18:44.048: INFO: PD Read/Writer Iteration #1
STEP: submitting host0Pod to kubernetes
W0523 19:18:44.132475    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: reading a file in the container
May 23 19:18:45.186: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:46.290: INFO: error running kubectl exec to read file: exit status 1
stdout=
stderr=error: error executing remote command: error executing command in container: container not found ("mycontainer")
)
May 23 19:18:46.290: INFO: Error reading file: exit status 1
May 23 19:18:46.290: INFO: Unexpected error occurred: exit status 1
```
Now I've run this fix on e2e pd test 5 times and no longer see any failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
8 participants