Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #37202

Closed
Random-Liu opened this issue Nov 21, 2016 · 2 comments · Fixed by #37203
Assignees
Labels
area/kubelet area/test kind/flake Categorizes issue or PR as related to a flaky test. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@Random-Liu
Copy link
Member

Random-Liu commented Nov 21, 2016

The restart test is broke by #37070.

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:124
Expected error:
    <*errors.errorString | 0xc420cd42a0>: {
        s: "couldn't find 28 pods within 5m0s; last error: expected to find 28 pods but found only 29",
    }
    couldn't find 28 pods within 5m0s; last error: expected to find 28 pods but found only 29
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:119

The original issue is #34003. The reason is #34003 (comment):

The mirror pod of the e2e-image-puller pod has been deleted in the network partition test by the node controller. kubelet currently doesn't try to recreate the mirror pod (or even sync it) if the pod has already terminated. In the restart test, kubelet got restarted, and the in-memory status cache was cleared. kubelet then decides to sync the pod once to generate the status, which leads to creating of the mirror pod.

We fixed this by filtering out RestartNever mirror pod in the test. However #37070 changed image-puller to RestartOnFailure which broke the workaround.

A quick fix is to filter out non-RestartAlways pods. Because either RestartNever or RestartOnFailure pods could succeed, and we can not deal with terminated mirror pods very well now.

@yujuhong @gmarek
/cc @kubernetes/sig-node

@Random-Liu Random-Liu added area/test kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet labels Nov 21, 2016
@Random-Liu Random-Liu added this to the v1.5 milestone Nov 21, 2016
@Random-Liu Random-Liu self-assigned this Nov 21, 2016
@gmarek
Copy link
Contributor

gmarek commented Nov 21, 2016

Yeah... I see there's a number of problems here. I looked at the history of the image puller config and it looked to me that it was 'Never' from the beginning...

Thanks for the fix though.

@calebamiles
Copy link
Contributor

@Random-Liu @yujuhong, @gmarek is this a release blocker for 1.5? Please update the issue ASAP, thanks!

cc: @kubernetes/sig-node, @saad-ali, @dims

k8s-github-robot pushed a commit that referenced this issue Nov 21, 2016
Automatic merge from submit-queue

Filter out non-RestartAlways mirror pod in restart test.

Fixes #37202.

> A quick fix is to filter out non-RestartAlways pods. Because either RestartNever and RestartOnFailure pods could succeed, and we can not deal with terminated mirror pods very well now.

@yujuhong @gmarek 
/cc @kubernetes/sig-node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test kind/flake Categorizes issue or PR as related to a flaky test. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants