-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803
Comments
See also #21753 |
#21857 should fix this |
Closing as #21857 this is merged. |
I believe I saw this again on a run 3 days ago: #22105 (comment) |
@nikhiljindal I took a brief look at a more recent failure #22156 (comment) The test failed because of an orphaned pod.
But, looking at the apiserver logs, it looks like the label update received a 409 (http code for conflict), followed by a 200 GET, followed by a 200 PUT:
I think that just means we're clobbering the label update, because I see the wrong labels on the orpha pod long after:
And the test failed without the hash label:
|
I think the problem is the pod we get here: But this from staring at code, so i could be wrong. |
Both the failures posted by @bprashanth and @timstclair are same as #22088 which should be fixed by #22223. Closing this one. |
@nikhiljindal so the failure to update labels is unrelated? without digging too deep into the test it looks like we updated the rc selector but didn't update the orphaned pod. Am I wrong? |
Yes you are right @bprashanth. I looked at the failures for #22088 again and there the remaining pods did have the pod template hash label, while here they dont. Thanks for pointing that out. I will try to look at how we ended up with such a pod. |
Should be fixed by #22305 |
Deployment test failed:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/?debugUI=CLOUD
http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/30050/consoleFull
There's a lot going on here, but looking closer at the logs
In the test:
So the pod was still around from 21:34 -> 21:35 (time skew with the node) and that's why we failed. This is the node: https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/artifacts/104.197.246.52%3A22-kubelet.log?debugUI=CLOUD
And looking up the container f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864
tl;dr without getting into a more intense debug session:
@janetkuo
The text was updated successfully, but these errors were encountered: