Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

Closed
bprashanth opened this issue Feb 23, 2016 · 10 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@bprashanth
Copy link
Contributor

Deployment test failed:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/?debugUI=CLOUD
http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/30050/consoleFull

• Failure [94.575 seconds]
Deployment
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:72
  RollingUpdateDeployment should scale up and down in the right order [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:50

  Feb 23 13:35:27.835: Failed to remove deployment redis-deployment-2 pods!

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:196

There's a lot going on here, but looking closer at the logs

In the test:

Feb 23 13:34:57.311: INFO: ensuring deployment redis-deployment-2 rcs were deleted
Feb 23 13:34:57.518: INFO: ensuring deployment redis-deployment-2 pods were deleted
Feb 23 13:35:27.835: FAIL: Failed to remove deployment redis-deployment-2 pods!

So the pod was still around from 21:34 -> 21:35 (time skew with the node) and that's why we failed. This is the node: https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/artifacts/104.197.246.52%3A22-kubelet.log?debugUI=CLOUD

I0223 21:34:52.995950    3376 server.go:594] Event(api.ObjectReference{Kind:"Pod", 
Namespace:"e2e-tests-deployment-fkfv5", Name:"redis-deployment-2-tztd7-obyq6", UID:
"40f88db0-da75-11e5-b5ae-42010af00002", APIVersion:"v1", ResourceVersion:"2169", 
FieldPath:"spec.containers{redis}"}): type: 'Warning' reason: 'Failed' Failed to start 
container with docker id a64a18734b36 with error: API error (500): Cannot start container 
a64a18734b36f5238d471da58bbe44141dcc3111f23723497f8007f80dc6b311: cannot join network of a 
non running container: f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864
...
I0223 21:34:53.622433    3376 kubelet.go:2187] Killing unwanted pod "redis-deployment-2-tztd7-
obyq6"

And looking up the container f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864

I0223 21:34:48.067817    3376 manager.go:1519] DNS ResolvConfPath exists: /var/lib/docker/containers/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864/resolv.conf. Will attempt to add ndots option: options ndots:5
I0223 21:34:48.068529    3376 factory.go:78] Using factory "docker" for container "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864"
I0223 21:34:48.085143    3376 manager.go:799] Added container: "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864" (aliases: [k8s_POD.6059dfa2_redis-deployment-2-tztd7-obyq6_e2e-tests-deployment-fkfv5_40f88db0-da75-11e5-b5ae-42010af00002_807d5331 f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864], namespace: "docker")
I0223 21:34:48.085983    3376 handler.go:320] Added event &{/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864 2016-02-23 21:34:47.137133486 +0000 UTC containerCreation {<nil>}}
I0223 21:34:48.086109    3376 container.go:386] Start housekeeping for container "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864"
I0223 21:34:52.327324    3376 manager.go:1307] Killing container "f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864 e2e-tests-deployment-fkfv5/redis-deployment-2-tztd7-obyq6" with 0 second grace period

tl;dr without getting into a more intense debug session:

  1. the kubelet got confused. Perhaps: serve_hostname container exiting immediately #21605
  2. The deployment test probably shoud wait to confirm that the pod was ready (because from the kubelet logs it doesn't look like redis-deployment-2-tztd7-obyq6 was ever ready)

@janetkuo

@bprashanth bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. team/ux kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 23, 2016
@bgrant0607
Copy link
Member

See also #21753

@bgrant0607 bgrant0607 changed the title e2e flake: RollingUpdateDeployment should scale up and down in the right order e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] Feb 23, 2016
@nikhiljindal
Copy link
Contributor

#21857 should fix this

@nikhiljindal nikhiljindal self-assigned this Feb 24, 2016
@nikhiljindal
Copy link
Contributor

Closing as #21857 this is merged.
Please reopen if it still occurs

@timstclair
Copy link

I believe I saw this again on a run 3 days ago: #22105 (comment)

@bprashanth
Copy link
Contributor Author

@nikhiljindal I took a brief look at a more recent failure #22156 (comment)

The test failed because of an orphaned pod.

  • first there are 2 nginx pods and 1 redis pod created
Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:09 -0800 PST - event for nginx-controller: {replicaset-controller } SuccessfulCreate: Created pod: nginx-controller-e651s
Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:44 -0800 PST - event for nginx-controller: {replicaset-controller } SuccessfulCreate: Created pod: nginx-controller-28gyl
Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:44 -0800 PST - event for redis-deployment-2-3325255056: {replicaset-controller } SuccessfulCreate: Created pod: redis-deployment-2-3325255056-1iijl
  • Out of the 2 nginx pods, nginx-controller-e651s nginx-controller-28gyl, nginx-controller-e651s had labels:
Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-
qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-
e651s, deployment controller will avoid syncing.
  • The controller was updated with a label 2516083051:
I0229 22:11:42.858185       7 deployment.go:203] Observed the update of rs nginx-controller's pod template with hash 2516083051.

I0229 22:11:44.030979       7 deployment.go:229] Updated rs nginx-controller's selector and label with hash 2516083051.
I0229 22:11:44.031458       7 deployment.go:219] Labeled rs nginx-controller's pods with hash 2516083051.
I0229 22:11:43.578334       7 deployment.go:256] Labeled pod nginx-controller-e651s with hash 2516083051.

But, looking at the apiserver logs, it looks like the label update received a 409 (http code for conflict), followed by a 200 GET, followed by a 200 PUT:

I0229 22:11:43.291721       6 handlers.go:152] GET /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s: (18.559912ms) 200 [[kube-controller-manager/v1.2.0 (linux/amd64) kubernetes/06f8da2/deployment-controller] 127.0.0.1:56092]
I0229 22:11:43.576153       6 trace.go:57] Trace "Update /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s" (started 2016-02-29 22:11:43.309566016 +0000 UTC):
I0229 22:11:43.576926       6 handlers.go:152] PUT /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s: (267.667254ms) 200 [[kube-controller-manager/v1.2.0 (linux/amd64) kubernetes/06f8da2/deployment-controller] 127.0.0.1:56092]

https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/pr-logs/pull/22156/kubernetes-pull-build-test-e2e-gce/31024/artifacts/104.197.237.228%3A22-kube-apiserver.log?debugUI=CLOUD

I think that just means we're clobbering the label update, because I see the wrong labels on the orpha pod long after:

I0229 22:11:44.888064       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[pod:nginx name:sample-pod-2]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.624607       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.624652       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[pod:nginx name:sample-pod-2]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.631001       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.

And the test failed without the hash label:


• Failure [86.417 seconds]
Deployment
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:72
  RollingUpdateDeployment should scale up and down in the right order [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:50

  Feb 29 14:12:25.219: Err : timed out waiting for the condition
  . Failed to remove deployment redis-deployment-2 pods : &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/namespaces/e2e-tests-deployment-qpli8/pods ResourceVersion:4719} Items:[{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:nginx-controller-e651s GenerateName:nginx-controller- Namespace:e2e-tests-deployment-qpli8 SelfLink:/api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s UID:565894a2-df31-11e5-9b0b-42010af00002 ResourceVersion:3241 Generation:0 CreationTimestamp:2016-02-29 14:11:09 -0800 PST DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[name:sample-pod-2 pod:nginx] 

@bprashanth
Copy link
Contributor Author

I think the problem is the pod we get here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/deployment/deployment.go#L301
doesn't have the pod-template-hash key:
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/deployment/deployment.go#L250

But this from staring at code, so i could be wrong.

@nikhiljindal
Copy link
Contributor

Both the failures posted by @bprashanth and @timstclair are same as #22088 which should be fixed by #22223.

Closing this one.

@bprashanth
Copy link
Contributor Author

@nikhiljindal so the failure to update labels is unrelated? without digging too deep into the test it looks like we updated the rc selector but didn't update the orphaned pod. Am I wrong?

@nikhiljindal
Copy link
Contributor

Yes you are right @bprashanth.

I looked at the failures for #22088 again and there the remaining pods did have the pod template hash label, while here they dont. Thanks for pointing that out.

I will try to look at how we ended up with such a pod.
I am also trying to debug why adding a label to pod is timing out: #21753 (comment).

@nikhiljindal
Copy link
Contributor

Should be fixed by #22305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants