e2e flake: Namespace cannot be deleted in time #26290

yujuhong · 2016-05-25T18:32:40Z

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-serial/1356/

[AfterEach] [k8s.io] DaemonRestart [Disruptive]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_restart.go:251

â€¢ Failure in Spec Teardown (AfterEach) [368.553 seconds]
[k8s.io] DaemonRestart [Disruptive]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:514
  Kubelet should not restart containers across restart [AfterEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_restart.go:316

  May 24 16:09:21.361: Couldn't delete ns "e2e-tests-daemonrestart-cte80": namespace e2e-tests-daemonrestart-cte80 was not deleted within limit: timed out waiting for the condition, pods remaining: [daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-3ctay daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-5f85d daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-7atpa daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-cla05 daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-doqkv daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-ff5ha daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-grm06 daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-j9agm daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-jj1ju daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-pveid]

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:193

The pods were not deleted in time, causing cascading failures in the following tests.

The text was updated successfully, but these errors were encountered:

fejta · 2016-05-25T23:12:54Z

Is this team/node or cluster or??

wojtek-t · 2016-05-26T08:30:31Z

Another instance is here:
http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-500-gce/3375/

BTW - I think we have another issue open for that, but I can't find it now.

lavalamp · 2016-05-27T00:09:00Z

This is control plane. @liggitt? @derekwaynecarr? One of you guys want this?

wojtek-t · 2016-05-27T08:39:25Z

So actually the failure from kubemark seems to be a different issue - "event etcd" is not working there, so we weren't able to remove events at all (we weren't even able to list them).
BTW, that's a good question what we should do in such case. @lavalamp - thoughts?

derekwaynecarr · 2016-05-27T15:06:05Z

@wojtek-t - I don't think we can safely delete a namespace unless all stores reply back.

this is usually always an issue kubelet side. I can look next week, but would not object to help from others.

dims · 2016-06-06T18:58:01Z

Is this the same as #23514 ?

derekwaynecarr · 2016-06-07T21:14:26Z

@dims - this is different than #23514 which was reporting an error message that had shown a double delete request for a namespace. This issue is different.

derekwaynecarr · 2016-06-07T21:16:09Z

I am trying to think of more comprehensive information so these are much easier to debug.

I will look to add a PR that dumps out more information about pods when hung tomorrow morning. The majority of the time the issue is kubelet side, but dumping the state of the pods as seen in the API server would help to know if pods ever started, if images were ever pulled, etc.

gmarek · 2016-06-08T07:28:50Z

@wojtek-t - we can turn on event etcd in kubemark - it's implemented but I think it's not enabled by default. @caesarxuchao - do you think it's possible that it's related to ownerRef?

wojtek-t · 2016-06-08T07:35:56Z

if it's not enabled by default, why we would even require it in that particular run? this sounds strange...

gmarek · 2016-06-08T07:40:48Z

In kubemark - I have no idea what's the default config for gce-serial.

caesarxuchao · 2016-06-08T07:50:29Z

No one is setting ownerRef or OrphanDependents, so shouldn't be caused by the gc.

I'm assigned with other "couldn't delete ns" flakes on gke, which doesn't have master logs. This gce flake is a good sample, so I'll take a look tomorrow.

I took a glance at the api server's log, it hasn't received any deletion requests for the remaining pods, which is weird.

davidopp · 2016-06-08T07:54:54Z

@caesarxuchao thanks for your comments! Can we reassign this issue to you? (if so, please just reassign it yourself)

derekwaynecarr · 2016-06-08T13:40:47Z

@caesarxuchao - if the pod is already terminating, the issue is that the kubelet needs to send the final delete request (not the namespace controller).

caesarxuchao · 2016-06-08T16:42:07Z

@derekwaynecarr thanks for the hint.

caesarxuchao · 2016-06-08T22:17:26Z

@deads2k could this be fixed by your #25662? The flake occurred before your 25662 merged.

fejta · 2016-06-15T21:34:48Z

New hit on https://console.cloud.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/27461/kubernetes-pull-build-test-e2e-gce/44859/

yujuhong · 2016-06-15T22:24:32Z

The failed test:

[BeforeEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:129
STEP: Creating a kubernetes client
Jun 15 13:59:35.196: INFO: >>> kubeConfig: /var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config

STEP: Building a namespace api object
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:181
[It] should apply a new configuration to an existing RC
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:446
STEP: creating Redis RC
Jun 15 13:59:35.249: INFO: Running '/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/kubernetes/platforms/linux/amd64/kubectl --server=https://104.155.145.244 --kubeconfig=/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config create -f - --namespace=e2e-tests-kubectl-umy2f'
Jun 15 13:59:35.390: INFO: stderr: ""
Jun 15 13:59:35.390: INFO: stdout: "replicationcontroller \"redis-master\" created\n"
STEP: applying a modified configuration
Jun 15 13:59:35.392: INFO: Running '/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/kubernetes/platforms/linux/amd64/kubectl --server=https://104.155.145.244 --kubeconfig=/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config apply -f - --namespace=e2e-tests-kubectl-umy2f'
Jun 15 13:59:35.619: INFO: stderr: ""
Jun 15 13:59:35.619: INFO: stdout: "replicationcontroller \"redis-master\" configured\n"
STEP: checking the result
[AfterEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:130
Jun 15 13:59:35.648: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "e2e-tests-kubectl-umy2f" for this suite.
Jun 15 14:04:35.684: INFO: Pod e2e-tests-kubectl-umy2f redis-master-wxte5 on node e2e-gce-agent-pr-17-0-minion-group-11xr remains, has deletion timestamp 2016-06-15T14:00:05-07:00
Jun 15 14:04:35.685: INFO: Couldn't delete ns "e2e-tests-kubectl-umy2f": namespace e2e-tests-kubectl-umy2f was not deleted within limit: timed out waiting for the condition, pods remaining: [redis-master-wxte5]


â€¢ Failure in Spec Teardown (AfterEach) [300.490 seconds]
[k8s.io] Kubectl client
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:660
  [k8s.io] Kubectl apply [AfterEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:660
    should apply a new configuration to an existing RC
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:446

    Jun 15 14:04:35.685: Couldn't delete ns "e2e-tests-kubectl-umy2f": namespace e2e-tests-kubectl-umy2f was not deleted within limit: timed out waiting for the condition, pods remaining: [redis-master-wxte5]

    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:230

kubelet log:

0615 20:59:35.639880    3487 config.go:384] Receiving a new pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640051    3487 kubelet.go:2509] SyncLoop (ADD, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640204    3487 kubelet.go:3503] Generating status for "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640460    3487 volume_manager.go:254] Waiting for volumes to attach and mount for pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.657424    3487 manager.go:422] Status for pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)" updated successfully: {status:{Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason:ContainersNotReady Message:containers with unready status: [redis-master]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason: Message:}] Message: Reason: HostIP:10.240.0.8 PodIP: StartTime:0xc82107ae80 InitContainerStatuses:[] ContainerStatuses:[{Name:redis-master State:{Waiting:0xc82107ae40 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:gcr.io/google_containers/redis:e2e ImageID: ContainerID:}]} version:1 podName:redis-master-wxte5 podNamespace:e2e-tests-kubectl-umy2f}
I0615 20:59:35.658707    3487 config.go:269] Setting pods for source api
I0615 20:59:35.659303    3487 kubelet.go:2522] SyncLoop (RECONCILE, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.753487    3487 config.go:269] Setting pods for source api
I0615 20:59:35.754113    3487 kubelet.go:2516] SyncLoop (UPDATE, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002):DeletionTimestamp=2016-06-15T21:00:05Z"
I0615 20:59:35.896890    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:35.897042    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
I0615 20:59:35.897245    3487 empty_dir.go:248] pod 10f3ad41-333c-11e6-8f34-42010af00002: mounting tmpfs for volume wrapped_default-token-l4vpf with opts []
E0615 20:59:35.902804    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:35.902860    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:35.997239    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:35.997343    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.000521    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.000549    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.097542    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.097665    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.100506    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.100814    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.197903    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.198013    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.202820    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.202849    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.298251    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.298364    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.301125    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.301153    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.398611    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.398710    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.403248    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.403278    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.499278    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.499440    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.502350    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.502379    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.599645    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.599776    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.604393    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.604426    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
...

kubelet saw the pod at 20:59:35.640051 and got the DELETE request at 20:59:35.754113. The gap was less than 1 second. Not sure why the pod was deleted before it was running (/cc @kubernetes/kubectl).

The more serious problem is that kubelet continue trying the mount the volume (seemingly in a very tight loop) and couldn't retrieve the secret. It never stopped trying to mount.
/cc @kubernetes/sig-storage @saad-ali

saad-ali · 2016-06-15T23:12:19Z

The more serious problem is that kubelet continue trying the mount the volume (seemingly in a very tight loop) and couldn't retrieve the secret. It never stopped trying to mount.

That is expected. The kubelet volume manager runs an asynchronous loop that attempts to mount volumes for pods if they are not mounted (retrying on failure). It does not block any kubelet functionality except the particular pod's start up goroutine, and even that times out eventually as well. But, since it results in API calls, there should be back off logic: opened #27492 to track that.

volumeManager.WaitForAttachAndMount() call blocks for up to 20 minutes currently as it waits for volumes to attach/mount. It is set that high because we've seen cloud operations take several minutes to complete for some volume plugins and as long as the volumes are not ready the pod is not going anywhere. In this case the volumes will never be ready because the secret was deleted. After speaking with @yujuhong, we agreed to drop the timeout to 2 minutes. The idea being if volumes are not ready after 1 minute, we'll fail out and let kubelet decide if the pod is still needed, if it is, kubelet will retry and volumeManager.WaitForAttachAndMount() will be called again.

ghost · 2016-06-16T03:52:02Z

I've seen several of these today - latest ones #27332 , #27504.

Searching through the auto-generated e2e flake issues, I see several others that look like they might have the same underlying cause:

#27507
#27503
#27502
#27390

caesarxuchao · 2016-06-16T04:46:54Z

@saad-ali so it's the volume mount retry loop prevented kubelet from deleting the pod?

saad-ali · 2016-06-16T05:20:03Z

@saad-ali so it's the volume mount retry loop prevented kubelet from deleting the pod?

The syncPod method that is used to set up a pod, calls the volume manager to make sure that all the volumes for the pod are attached/mounted--volumeManager.WaitForAttachAndMount() this call has a super long timeout (20 minutes). The reason for the long timeout for this call is explained in the comment above. These pods appear to be waiting for their secret volume to mount, but mount fails repeatedly because the secret object does not exist (has been deleted?), so volumeManager.WaitForAttachAndMount() continues to wait until its timeout, but before it can return a failure to kubelet, the tests hit their timeout and fail out. As mentioned above, PR #27491 will reduce the timeout which should imitate the previous behavior more closely.

lavalamp · 2016-06-16T16:51:24Z

@saad-ali Does that wait get cancelled if the pod is deleted? We need to get this fixed asap or I'm gonna have to roll back your big change :/

yujuhong · 2016-06-16T16:53:59Z

@saad-ali I think the volume manager can stop trying to mount once the DeletionTimestamp of the pod is set (since the manager gets the pods from kubelet anyway), but I didn't read the code to know whether this would be hard to implement.

derekwaynecarr · 2016-06-16T17:30:10Z

+1 on having this cancel on pod deletion or agree on roll-back. These
types of flakes are horrible to debug.

On Thu, Jun 16, 2016 at 12:54 PM, Yu-Ju Hong notifications@github.com
wrote:

@saad-ali https://github.com/saad-ali I think the volume manager can
stop trying to mount once the DeletionTimestamp of the pod is set (since
the manager gets the pods from kubelet anyway), but I didn't read the code
to know whether this would be hard to implement.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#26290 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbGFnN5mpeOh_PSZjVxizc_q4_9qMks5qMX_HgaJpZM4Im1k0
.

yujuhong · 2016-06-16T18:13:02Z

To be fair, kubelet doesn't have the ability to cancel a docker operation midway as well. If the worker is blocked at pulling an image, kubelet'd wait until the pull succeeds and delete the pod. We should ideally cancel the sync iteration on a deletion request, but this is not the case now.

The difference here for attachment/mounting is that

The attach/mount is destined to fail because the secret has been deleted, but the pod worker will wait until the timeout hits (while the mounter retries).
The 20 min timeout is too long.

(2) has already been addressed by #27491, which lowered the timeout to 2min.
For (1), it'd be nice to distinguish between "still trying to mount/attach" and "attempt failed, retrying", and let the function returns in the latter case to unblock the pod worker. Alternatively, my previous suggestion should also alleviate the situation.

saad-ali · 2016-06-16T18:26:34Z

Serial test take a long time to reflect changes, since they run so slowly (3.5 hours or so). PR #27491 was merged to lower the timeout for waitForAttach at 10:57 PM last night. There were 2 runs on gce-serial that failed overnight after that because of build issues (one was manually stopped the other hit build timeout) then a 7:39 AM run that was green. Same story on the GKE side. So far lowering the timeout seems to have fixed this test. Will continue to monitor subsequent serial runs.

saad-ali · 2016-06-16T18:34:51Z

Closing this in favor of the automated issue #27502

yujuhong added kind/flake Categorizes issue or PR as related to a flaky test. area/test labels May 25, 2016

fejta added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 25, 2016

fejta assigned dchen1107 May 25, 2016

yujuhong added team/control-plane sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 25, 2016

yujuhong unassigned dchen1107 May 25, 2016

fejta assigned lavalamp May 26, 2016

lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 27, 2016

dims mentioned this issue Jun 6, 2016

e2e flake: failed to delete namespace (that was apparently already being deleted) #23514

Closed

caesarxuchao assigned caesarxuchao and unassigned lavalamp Jun 8, 2016

caesarxuchao mentioned this issue Jun 9, 2016

Let kubelet log the DeletionTimestamp if it's not nil in update #27101

Merged

fejta reopened this Jun 15, 2016

fejta mentioned this issue Jun 15, 2016

e2e: Allow skipping tests for specific runtimes, skip a few tests under rkt #27461

Merged

ghost mentioned this issue Jun 15, 2016

Compare v1.Service to v1.Service. #27390

Merged

yujuhong added sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster and removed team/control-plane labels Jun 15, 2016

yujuhong assigned saad-ali and unassigned caesarxuchao Jun 15, 2016

yujuhong added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 15, 2016

yujuhong mentioned this issue Jun 15, 2016

Set the podAttachAndMountTimeout to a lower value #27491

Merged

saad-ali mentioned this issue Jun 15, 2016

Kubelet Volume Manager Reconciler should backoff on retries after mount failure #27492

Closed

yujuhong mentioned this issue Jun 16, 2016

[k8s.io] DaemonRestart [Disruptive] Kubelet should not restart containers across restart {Kubernetes e2e suite} #27502

Closed

This was referenced Jun 16, 2016

federation: Creating kubeconfig files to be used for creating secrets for clusters on aws and gke #27332

Merged

Dumping logs of federation pods (federation-apiserver, federation-controller-manager) on e2e test failure #27504

Merged

caesarxuchao mentioned this issue Jun 16, 2016

[k8s.io] Kubectl client [k8s.io] Kubectl run pod should create a pod from an image when restart is Never [Conformance] {Kubernetes e2e suite} #27507

Closed

bprashanth mentioned this issue Jun 16, 2016

e2e test failure: Kubectl apply should apply a new configuration to an existing RC #27521

Closed

yujuhong mentioned this issue Jun 16, 2016

Change default value of deleting-pods-burst to 1 #27422

Merged

saad-ali closed this as completed Jun 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Namespace cannot be deleted in time #26290

e2e flake: Namespace cannot be deleted in time #26290

yujuhong commented May 25, 2016

fejta commented May 25, 2016

wojtek-t commented May 26, 2016

lavalamp commented May 27, 2016

wojtek-t commented May 27, 2016

derekwaynecarr commented May 27, 2016

dims commented Jun 6, 2016

derekwaynecarr commented Jun 7, 2016

derekwaynecarr commented Jun 7, 2016

gmarek commented Jun 8, 2016

wojtek-t commented Jun 8, 2016

gmarek commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

davidopp commented Jun 8, 2016

derekwaynecarr commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

fejta commented Jun 15, 2016

yujuhong commented Jun 15, 2016

saad-ali commented Jun 15, 2016

ghost commented Jun 16, 2016

caesarxuchao commented Jun 16, 2016

saad-ali commented Jun 16, 2016

lavalamp commented Jun 16, 2016

yujuhong commented Jun 16, 2016

derekwaynecarr commented Jun 16, 2016

yujuhong commented Jun 16, 2016

saad-ali commented Jun 16, 2016

saad-ali commented Jun 16, 2016

e2e flake: Namespace cannot be deleted in time #26290

e2e flake: Namespace cannot be deleted in time #26290

Comments

yujuhong commented May 25, 2016

fejta commented May 25, 2016

wojtek-t commented May 26, 2016

lavalamp commented May 27, 2016

wojtek-t commented May 27, 2016

derekwaynecarr commented May 27, 2016

dims commented Jun 6, 2016

derekwaynecarr commented Jun 7, 2016

derekwaynecarr commented Jun 7, 2016

gmarek commented Jun 8, 2016

wojtek-t commented Jun 8, 2016

gmarek commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

davidopp commented Jun 8, 2016

derekwaynecarr commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

caesarxuchao commented Jun 8, 2016

fejta commented Jun 15, 2016

yujuhong commented Jun 15, 2016

saad-ali commented Jun 15, 2016

ghost commented Jun 16, 2016

caesarxuchao commented Jun 16, 2016

saad-ali commented Jun 16, 2016

lavalamp commented Jun 16, 2016

yujuhong commented Jun 16, 2016

derekwaynecarr commented Jun 16, 2016

yujuhong commented Jun 16, 2016

saad-ali commented Jun 16, 2016

saad-ali commented Jun 16, 2016