Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in e2e test, when kubectl exec fails to find the container to run a command, it should retry #26100

Merged
merged 1 commit into from
May 24, 2016

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented May 23, 2016

fix #26076
Without retrying upon "container not found" error, Pod Disks test failed on the following error:

[k8s.io] Pod Disks 
  should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:108
STEP: Creating a kubernetes client
May 23 19:18:02.254: INFO: >>> TestContext.KubeConfig: /root/.kube/config

STEP: Building a namespace api object
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Pod Disks
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:69
[It] should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:271
STEP: creating PD1
May 23 19:18:06.678: INFO: Successfully created a new PD: "rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c".
STEP: creating PD2
May 23 19:18:11.216: INFO: Successfully created a new PD: "rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c".
May 23 19:18:11.216: INFO: PD Read/Writer Iteration #0
STEP: submitting host0Pod to kubernetes
W0523 19:18:11.279910    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: writing a file in the container
May 23 19:18:39.088: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1394466581702052925' > '/testpd1/tracker0''
May 23 19:18:40.250: INFO: Wrote value: "1394466581702052925" to PD1 ("rootfs-e2e-11dd5f5b-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: writing a file in the container
May 23 19:18:40.251: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- /bin/sh -c echo '1740704063962701662' > '/testpd2/tracker0''
May 23 19:18:41.433: INFO: Wrote value: "1740704063962701662" to PD2 ("rootfs-e2e-141f062d-211b-11e6-a3ff-b8ca3a62792c") from pod "pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c" container "mycontainer"
STEP: reading a file in the container
May 23 19:18:41.433: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:42.585: INFO: Read file "/testpd1/tracker0" with content: 1394466581702052925

STEP: reading a file in the container
May 23 19:18:42.585: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd2/tracker0'
May 23 19:18:43.779: INFO: Read file "/testpd2/tracker0" with content: 1740704063962701662

STEP: deleting host0Pod
May 23 19:18:44.048: INFO: PD Read/Writer Iteration #1
STEP: submitting host0Pod to kubernetes
W0523 19:18:44.132475    4984 request.go:347] Field selector: v1 - pods - metadata.name - pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c: need to check if this is versioned correctly.
STEP: reading a file in the container
May 23 19:18:45.186: INFO: Running '/srv/dev/kubernetes/_output/dockerized/bin/linux/amd64/kubectl kubectl --server=https://130.211.199.187 --kubeconfig=/root/.kube/config exec --namespace=e2e-tests-pod-disks-3t3g8 pd-test-16d3653c-211b-11e6-a3ff-b8ca3a62792c -c=mycontainer -- cat /testpd1/tracker0'
May 23 19:18:46.290: INFO: error running kubectl exec to read file: exit status 1
stdout=
stderr=error: error executing remote command: error executing command in container: container not found ("mycontainer")
)
May 23 19:18:46.290: INFO: Error reading file: exit status 1
May 23 19:18:46.290: INFO: Unexpected error occurred: exit status 1

Now I've run this fix on e2e pd test 5 times and no longer see any failure

@saad-ali
Copy link
Member

LGTM Thanks @rootfs

@saad-ali saad-ali added lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 23, 2016
@k8s-github-robot k8s-github-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-label-needed and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 23, 2016
@saad-ali saad-ali assigned saad-ali and unassigned ixdy May 23, 2016
@saad-ali saad-ali added release-note-none Denotes a PR that doesn't merit a release note. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels May 23, 2016
@ncdc
Copy link
Member

ncdc commented May 23, 2016

kubectl exec won't work unless the pod is Running. How is it that the pod is Running but a container can't be found?

@rootfs

@@ -477,6 +477,13 @@ func kubectlExecWithRetry(namespace string, podName, containerName string, args
continue
}
}
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a check for a non-nil error right before this. I would move the strings.Contains check you're adding into that block instead of adding another one, assuming we decide to keep this.

@saad-ali saad-ali removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2016
@saad-ali
Copy link
Member

From @ncdc on kubernetes-dev Slack:

this is potentially because exec doesn’t yet support looking a pod up by uid

[2:13] 
the test creates, deletes, recreates the same pod name multiple times

[2:13] 
i think what’s happening is exec is finding the wrong pod while the old one is being torn down, maybe?

[2:14] 
we could change the test to vary the pod name each time through the loop

[2:15] 
this sets up the pod https://github.com/kubernetes/kubernetes/blob/master/test/e2e/pd.go#L179

it does assign it a name that includes a random uuid, but that is only ever set up once, outside of the for loop

[2:16] 
[...] i’d maybe suggest trying this instead of retrying exec on ‘container not found'

This indicates that the pod is being torn down right after WaitForPodRunningSlow succeeds and right before WriteFileViaContainer. Which is a symptom of the race conditions between orphan volumes loop and pod creation loop that were held together with glue.

This fix will mask the issue by retrying until the pod with the same name is recreated. But it does indicate that #26076 destabilized PD mount/unmount code.

To unblock the submit queue, I'll LGTM this.

I am working on kubelet volume manager that will fix the actual races, and along with that change, I will revert this change and verify the races are fixed and modify the tests further as needed.

@saad-ali saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2016
@saad-ali
Copy link
Member

@k8s-oncall Could you please manually merge this

if strings.Contains(strings.ToLower(string(stdErrBytes)), "container not found") {
// Retry on "container not found" errors
Logf("Warning: kubectl exec encountered container not found.\nerr=%v\nstdout=%v\nstderr=%v)", err, string(stdOutBytes), string(stdErrBytes))
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a sleep before trying again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there should

@saad-ali saad-ali removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2016
@saad-ali
Copy link
Member

@rootfs could you address the open comments. In the meantime, we're going to move the tests to flaky (#26076)

…ould retry

Signed-off-by: Huamin Chen <hchen@redhat.com>
@rootfs
Copy link
Contributor Author

rootfs commented May 24, 2016

@saad-ali feedback addressed. Have seen pd test running 10 times for 3 hours without any failure

if strings.Contains(strings.ToLower(string(stdErrBytes)), "container not found") {
// Retry on "container not found" errors
Logf("Warning: kubectl exec encountered container not found.\nerr=%v\nstdout=%v\nstderr=%v)", err, string(stdOutBytes), string(stdErrBytes))
time.Sleep(2 * time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep should happen for both "i/o timeout" and "container not found", no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any i/o timeout yet. If it happens, I am not sure timeout helps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@saad-ali
Copy link
Member

LGTM

@saad-ali saad-ali added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 24, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 24, 2016

GCE e2e build/test passed for commit 56719f8.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit db74335 into kubernetes:master May 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e flake: Pod Disks tests are very flaky
7 participants