e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

bprashanth · 2016-02-23T22:51:32Z

Deployment test failed:
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/?debugUI=CLOUD
http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/30050/consoleFull

• Failure [94.575 seconds]
Deployment
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:72
  RollingUpdateDeployment should scale up and down in the right order [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:50

  Feb 23 13:35:27.835: Failed to remove deployment redis-deployment-2 pods!

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:196

There's a lot going on here, but looking closer at the logs

In the test:

Feb 23 13:34:57.311: INFO: ensuring deployment redis-deployment-2 rcs were deleted
Feb 23 13:34:57.518: INFO: ensuring deployment redis-deployment-2 pods were deleted
Feb 23 13:35:27.835: FAIL: Failed to remove deployment redis-deployment-2 pods!

So the pod was still around from 21:34 -> 21:35 (time skew with the node) and that's why we failed. This is the node: https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/pr-logs/pull/21399/kubernetes-pull-build-test-e2e-gce/30050/artifacts/104.197.246.52%3A22-kubelet.log?debugUI=CLOUD

I0223 21:34:52.995950    3376 server.go:594] Event(api.ObjectReference{Kind:"Pod", 
Namespace:"e2e-tests-deployment-fkfv5", Name:"redis-deployment-2-tztd7-obyq6", UID:
"40f88db0-da75-11e5-b5ae-42010af00002", APIVersion:"v1", ResourceVersion:"2169", 
FieldPath:"spec.containers{redis}"}): type: 'Warning' reason: 'Failed' Failed to start 
container with docker id a64a18734b36 with error: API error (500): Cannot start container 
a64a18734b36f5238d471da58bbe44141dcc3111f23723497f8007f80dc6b311: cannot join network of a 
non running container: f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864
...
I0223 21:34:53.622433    3376 kubelet.go:2187] Killing unwanted pod "redis-deployment-2-tztd7-
obyq6"

And looking up the container f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864

I0223 21:34:48.067817    3376 manager.go:1519] DNS ResolvConfPath exists: /var/lib/docker/containers/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864/resolv.conf. Will attempt to add ndots option: options ndots:5
I0223 21:34:48.068529    3376 factory.go:78] Using factory "docker" for container "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864"
I0223 21:34:48.085143    3376 manager.go:799] Added container: "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864" (aliases: [k8s_POD.6059dfa2_redis-deployment-2-tztd7-obyq6_e2e-tests-deployment-fkfv5_40f88db0-da75-11e5-b5ae-42010af00002_807d5331 f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864], namespace: "docker")
I0223 21:34:48.085983    3376 handler.go:320] Added event &{/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864 2016-02-23 21:34:47.137133486 +0000 UTC containerCreation {<nil>}}
I0223 21:34:48.086109    3376 container.go:386] Start housekeeping for container "/f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864"
I0223 21:34:52.327324    3376 manager.go:1307] Killing container "f14f53b81beea84ca81b6ab3f9c01011b62738f57a735f1edb4d44c915fef864 e2e-tests-deployment-fkfv5/redis-deployment-2-tztd7-obyq6" with 0 second grace period

tl;dr without getting into a more intense debug session:

the kubelet got confused. Perhaps: serve_hostname container exiting immediately #21605
The deployment test probably shoud wait to confirm that the pod was ready (because from the kubelet logs it doesn't look like redis-deployment-2-tztd7-obyq6 was ever ready)

@janetkuo

The text was updated successfully, but these errors were encountered:

bgrant0607 · 2016-02-23T23:27:32Z

See also #21753

nikhiljindal · 2016-02-24T21:31:51Z

#21857 should fix this

nikhiljindal · 2016-02-24T22:35:44Z

Closing as #21857 this is merged.
Please reopen if it still occurs

timstclair · 2016-02-29T22:52:11Z

I believe I saw this again on a run 3 days ago: #22105 (comment)

bprashanth · 2016-03-01T00:38:03Z

@nikhiljindal I took a brief look at a more recent failure #22156 (comment)

The test failed because of an orphaned pod.

first there are 2 nginx pods and 1 redis pod created

Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:09 -0800 PST - event for nginx-controller: {replicaset-controller } SuccessfulCreate: Created pod: nginx-controller-e651s
Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:44 -0800 PST - event for nginx-controller: {replicaset-controller } SuccessfulCreate: Created pod: nginx-controller-28gyl
Feb 29 14:12:25.268: INFO: At 2016-02-29 14:11:44 -0800 PST - event for redis-deployment-2-3325255056: {replicaset-controller } SuccessfulCreate: Created pod: redis-deployment-2-3325255056-1iijl

Out of the 2 nginx pods, nginx-controller-e651s nginx-controller-28gyl, nginx-controller-e651s had labels:

Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-
qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-
e651s, deployment controller will avoid syncing.

The controller was updated with a label 2516083051:

I0229 22:11:42.858185       7 deployment.go:203] Observed the update of rs nginx-controller's pod template with hash 2516083051.

I0229 22:11:44.030979       7 deployment.go:229] Updated rs nginx-controller's selector and label with hash 2516083051.
I0229 22:11:44.031458       7 deployment.go:219] Labeled rs nginx-controller's pods with hash 2516083051.
I0229 22:11:43.578334       7 deployment.go:256] Labeled pod nginx-controller-e651s with hash 2516083051.

But, looking at the apiserver logs, it looks like the label update received a 409 (http code for conflict), followed by a 200 GET, followed by a 200 PUT:

I0229 22:11:43.291721       6 handlers.go:152] GET /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s: (18.559912ms) 200 [[kube-controller-manager/v1.2.0 (linux/amd64) kubernetes/06f8da2/deployment-controller] 127.0.0.1:56092]
I0229 22:11:43.576153       6 trace.go:57] Trace "Update /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s" (started 2016-02-29 22:11:43.309566016 +0000 UTC):
I0229 22:11:43.576926       6 handlers.go:152] PUT /api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s: (267.667254ms) 200 [[kube-controller-manager/v1.2.0 (linux/amd64) kubernetes/06f8da2/deployment-controller] 127.0.0.1:56092]

https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/pr-logs/pull/22156/kubernetes-pull-build-test-e2e-gce/31024/artifacts/104.197.237.228%3A22-kube-apiserver.log?debugUI=CLOUD

I think that just means we're clobbering the label update, because I see the wrong labels on the orpha pod long after:

I0229 22:11:44.888064       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[pod:nginx name:sample-pod-2]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.624607       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.624652       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[pod:nginx name:sample-pod-2]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.
I0229 22:12:30.631001       7 deployment_controller.go:276] Error: could not find ReplicaSet for pod nginx-controller-e651s in namespace e2e-tests-deployment-qpli8 with labels: map[name:sample-pod-2 pod:nginx]. No ReplicaSets found for pod nginx-controller-e651s, deployment controller will avoid syncing.

And the test failed without the hash label:


• Failure [86.417 seconds]
Deployment
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:72
  RollingUpdateDeployment should scale up and down in the right order [It]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/deployment.go:50

  Feb 29 14:12:25.219: Err : timed out waiting for the condition
  . Failed to remove deployment redis-deployment-2 pods : &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink:/api/v1/namespaces/e2e-tests-deployment-qpli8/pods ResourceVersion:4719} Items:[{TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name:nginx-controller-e651s GenerateName:nginx-controller- Namespace:e2e-tests-deployment-qpli8 SelfLink:/api/v1/namespaces/e2e-tests-deployment-qpli8/pods/nginx-controller-e651s UID:565894a2-df31-11e5-9b0b-42010af00002 ResourceVersion:3241 Generation:0 CreationTimestamp:2016-02-29 14:11:09 -0800 PST DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[name:sample-pod-2 pod:nginx]

bprashanth · 2016-03-01T00:55:54Z

I think the problem is the pod we get here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/deployment/deployment.go#L301
doesn't have the pod-template-hash key:
https://github.com/kubernetes/kubernetes/blob/master/pkg/util/deployment/deployment.go#L250

But this from staring at code, so i could be wrong.

nikhiljindal · 2016-03-01T02:53:06Z

Both the failures posted by @bprashanth and @timstclair are same as #22088 which should be fixed by #22223.

Closing this one.

bprashanth · 2016-03-01T02:56:42Z

@nikhiljindal so the failure to update labels is unrelated? without digging too deep into the test it looks like we updated the rc selector but didn't update the orphaned pod. Am I wrong?

nikhiljindal · 2016-03-01T19:19:58Z

Yes you are right @bprashanth.

I looked at the failures for #22088 again and there the remaining pods did have the pod template hash label, while here they dont. Thanks for pointing that out.

I will try to look at how we ended up with such a pod.
I am also trying to debug why adding a label to pod is timing out: #21753 (comment).

nikhiljindal · 2016-03-01T21:30:42Z

Should be fixed by #22305

bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. team/ux kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 23, 2016

This was referenced Feb 23, 2016

Disable aws node security group ingress creation #21399

Merged

serve_hostname container exiting immediately #21605

Closed

bgrant0607 changed the title ~~e2e flake: RollingUpdateDeployment should scale up and down in the right order~~ e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] Feb 23, 2016

nikhiljindal self-assigned this Feb 24, 2016

nikhiljindal closed this as completed Feb 24, 2016

timstclair reopened this Feb 29, 2016

timstclair mentioned this issue Feb 29, 2016

Update mesos godeps to remove transient dependency on testify/mock #22105

Merged

bprashanth mentioned this issue Mar 1, 2016

Revert 20202 #22156

Merged

nikhiljindal closed this as completed Mar 1, 2016

nikhiljindal reopened this Mar 1, 2016

bgrant0607 added this to the v1.2 milestone Mar 1, 2016

bprashanth mentioned this issue Mar 1, 2016

Update the re-fetched pod with deployment labels. #22305

Merged

nikhiljindal closed this as completed Mar 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

bprashanth commented Feb 23, 2016

bgrant0607 commented Feb 23, 2016

nikhiljindal commented Feb 24, 2016

nikhiljindal commented Feb 24, 2016

timstclair commented Feb 29, 2016

bprashanth commented Mar 1, 2016

bprashanth commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016

bprashanth commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016

e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

e2e flake: RollingUpdateDeployment should scale up and down in the right order [failed remove pods] #21803

Comments

bprashanth commented Feb 23, 2016

bgrant0607 commented Feb 23, 2016

nikhiljindal commented Feb 24, 2016

nikhiljindal commented Feb 24, 2016

timstclair commented Feb 29, 2016

bprashanth commented Mar 1, 2016

bprashanth commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016

bprashanth commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016

nikhiljindal commented Mar 1, 2016