in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

rootfs · 2016-05-23T20:21:26Z

fix #26076
Without retrying upon "container not found" error, Pod Disks test failed on the following error:

[k8s.io] Pod Disks 
  should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:108
STEP: Creating a kubernetes client
May 23 19:18:02.254: INFO: >>> TestContext.KubeConfig: /root/.kube/config

STEP: Building a namespace api object
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:69
[It] should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
STEP: creating PD1
May 23 19:18:06.678: INFO: Successfully created a new PD: "rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c".
STEP: creating PD2
May 23 19:18:11.216: INFO: Successfully created a new PD: "rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c".
May 23 19:18:11.216: INFO: PD Read/Writer Iteration #0
STEP: submitting host0Pod to kubernetes
W0523 19:18:11.279910    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: writing a file in the container
May 23 19:18:39.088: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1394466581702052925' > '/testpd1/tracker0''
May 23 19:18:40.250: INFO: Wrote value: "1394466581702052925" to PD1 ("rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: writing a file in the container
May 23 19:18:40.251: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1740704063962701662' > '/testpd2/tracker0''
May 23 19:18:41.433: INFO: Wrote value: "1740704063962701662" to PD2 ("rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: reading a file in the container
May 23 19:18:41.433: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:42.585: INFO: Read file "/testpd1/tracker0" with content: 1394466581702052925

STEP: reading a file in the container
May 23 19:18:42.585: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd2/tracker0'
May 23 19:18:43.779: INFO: Read file "/testpd2/tracker0" with content: 1740704063962701662

STEP: deleting host0Pod
May 23 19:18:44.048: INFO: PD Read/Writer Iteration #1
STEP: submitting host0Pod to kubernetes
W0523 19:18:44.132475    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: reading a file in the container
May 23 19:18:45.186: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:46.290: INFO: error running kubectl exec to read file: exit status 1
stdout=
stderr=error: error executing remote command: error executing command in container: container not found ("mycontainer")
)
May 23 19:18:46.290: INFO: Error reading file: exit status 1
May 23 19:18:46.290: INFO: Unexpected error occurred: exit status 1

Now I've run this fix on e2e pd test 5 times and no longer see any failure

saad-ali · 2016-05-23T20:26:53Z

LGTM Thanks @rootfs

ncdc · 2016-05-23T20:45:02Z

kubectl exec won't work unless the pod is Running. How is it that the pod is Running but a container can't be found?

@rootfs

ncdc · 2016-05-23T20:46:46Z

test/e2e/framework/framework.go

@@ -477,6 +477,13 @@ func kubectlExecWithRetry(namespace string, podName, containerName string, args
 				continue
 			}
 		}
+		if err != nil {


There's already a check for a non-nil error right before this. I would move the strings.Contains check you're adding into that block instead of adding another one, assuming we decide to keep this.

saad-ali · 2016-05-23T21:35:15Z

From @ncdc on kubernetes-dev Slack:

this is potentially because exec doesn’t yet support looking a pod up by uid

[2:13] 
the test creates, deletes, recreates the same pod name multiple times

[2:13] 
i think what’s happening is exec is finding the wrong pod while the old one is being torn down, maybe?

[2:14] 
we could change the test to vary the pod name each time through the loop

[2:15] 
this sets up the pod https://github.com/kubernetes/kubernetes/blob/master/test/e2e/pd.go#L179

it does assign it a name that includes a random uuid, but that is only ever set up once, outside of the for loop

[2:16] 
[...] i’d maybe suggest trying this instead of retrying exec on ‘container not found'

This indicates that the pod is being torn down right after WaitForPodRunningSlow succeeds and right before WriteFileViaContainer. Which is a symptom of the race conditions between orphan volumes loop and pod creation loop that were held together with glue.

This fix will mask the issue by retrying until the pod with the same name is recreated. But it does indicate that #26076 destabilized PD mount/unmount code.

To unblock the submit queue, I'll LGTM this.

I am working on kubelet volume manager that will fix the actual races, and along with that change, I will revert this change and verify the races are fixed and modify the tests further as needed.

saad-ali · 2016-05-23T23:20:01Z

@k8s-oncall Could you please manually merge this

ixdy · 2016-05-24T00:04:10Z

test/e2e/framework/framework.go

+			if strings.Contains(strings.ToLower(string(stdErrBytes)), "container not found") {
+				// Retry on "container not found" errors
+				Logf("Warning: kubectl exec encountered container not found.\nerr=%v\nstdout=%v\nstderr=%v)", err, string(stdOutBytes), string(stdErrBytes))
+				continue


should there be a sleep before trying again?

Yes, there should

saad-ali · 2016-05-24T00:12:14Z

@rootfs could you address the open comments. In the meantime, we're going to move the tests to flaky (#26076)

…ould retry Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs · 2016-05-24T15:25:57Z

@saad-ali feedback addressed. Have seen pd test running 10 times for 3 hours without any failure

saad-ali · 2016-05-24T17:27:27Z

test/e2e/framework/framework.go

+			if strings.Contains(strings.ToLower(string(stdErrBytes)), "container not found") {
+				// Retry on "container not found" errors
+				Logf("Warning: kubectl exec encountered container not found.\nerr=%v\nstdout=%v\nstderr=%v)", err, string(stdOutBytes), string(stdErrBytes))
+				time.Sleep(2 * time.Second)


Sleep should happen for both "i/o timeout" and "container not found", no?

I haven't seen any i/o timeout yet. If it happens, I am not sure timeout helps.

saad-ali · 2016-05-24T18:43:37Z

LGTM

k8s-github-robot · 2016-05-24T21:24:22Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-24T22:02:27Z

GCE e2e build/test passed for commit 56719f8.

k8s-github-robot · 2016-05-24T22:21:07Z

Automatic merge from submit-queue

googlebot added the cla: yes label May 23, 2016

rootfs mentioned this pull request May 23, 2016

Revert "Detangle attach detach from gce" #26091

Closed

saad-ali added lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 23, 2016

k8s-github-robot assigned ixdy May 23, 2016

k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 23, 2016

saad-ali assigned saad-ali and unassigned ixdy May 23, 2016

saad-ali added release-note-none Denotes a PR that doesn't merit a release note. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels May 23, 2016

ncdc reviewed May 23, 2016
View reviewed changes

saad-ali removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2016

saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2016

ixdy reviewed May 24, 2016
View reviewed changes

saad-ali removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2016

saad-ali mentioned this pull request May 24, 2016

e2e flake: Pod Disks tests are very flaky #26076

Closed

when kubectl exec fails to find the container to run a command, it sh…

56719f8

…ould retry Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs force-pushed the fix-pd branch from 65ae642 to 56719f8 Compare May 24, 2016 14:18

saad-ali reviewed May 24, 2016
View reviewed changes

saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2016

k8s-github-robot merged commit db74335 into kubernetes:master May 24, 2016

ncdc mentioned this pull request Jun 8, 2016

Fix PD tests and move out of flaky #26141

Closed

rootfs mentioned this pull request Jun 8, 2016

randomize pd test container name #27084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

rootfs commented May 23, 2016

saad-ali commented May 23, 2016

ncdc commented May 23, 2016 •

edited

Loading

ncdc May 23, 2016

saad-ali commented May 23, 2016

saad-ali commented May 23, 2016

ixdy May 24, 2016

saad-ali May 24, 2016

saad-ali commented May 24, 2016

rootfs commented May 24, 2016

saad-ali May 24, 2016

rootfs May 24, 2016

saad-ali May 24, 2016

saad-ali commented May 24, 2016

k8s-github-robot commented May 24, 2016

k8s-bot commented May 24, 2016

k8s-github-robot commented May 24, 2016

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

Conversation

rootfs commented May 23, 2016

saad-ali commented May 23, 2016

ncdc commented May 23, 2016 • edited Loading

ncdc May 23, 2016

Choose a reason for hiding this comment

saad-ali commented May 23, 2016

saad-ali commented May 23, 2016

ixdy May 24, 2016

Choose a reason for hiding this comment

saad-ali May 24, 2016

Choose a reason for hiding this comment

saad-ali commented May 24, 2016

rootfs commented May 24, 2016

saad-ali May 24, 2016

Choose a reason for hiding this comment

rootfs May 24, 2016

Choose a reason for hiding this comment

saad-ali May 24, 2016

Choose a reason for hiding this comment

saad-ali commented May 24, 2016

k8s-github-robot commented May 24, 2016

k8s-bot commented May 24, 2016

k8s-github-robot commented May 24, 2016

ncdc commented May 23, 2016 •

edited

Loading