Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Namespace cannot be deleted in time #26290

Closed
yujuhong opened this issue May 25, 2016 · 31 comments
Closed

e2e flake: Namespace cannot be deleted in time #26290

yujuhong opened this issue May 25, 2016 · 31 comments
Assignees
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@yujuhong
Copy link
Contributor

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-serial/1356/

[AfterEach] [k8s.io] DaemonRestart [Disruptive]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_restart.go:251

• Failure in Spec Teardown (AfterEach) [368.553 seconds]
[k8s.io] DaemonRestart [Disruptive]
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:514
  Kubelet should not restart containers across restart [AfterEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/daemon_restart.go:316

  May 24 16:09:21.361: Couldn't delete ns "e2e-tests-daemonrestart-cte80": namespace e2e-tests-daemonrestart-cte80 was not deleted within limit: timed out waiting for the condition, pods remaining: [daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-3ctay daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-5f85d daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-7atpa daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-cla05 daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-doqkv daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-ff5ha daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-grm06 daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-j9agm daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-jj1ju daemonrestart10-a0580743-2203-11e6-a77d-0242ac11000d-pveid]

  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:193

The pods were not deleted in time, causing cascading failures in the following tests.

@yujuhong yujuhong added kind/flake Categorizes issue or PR as related to a flaky test. area/test labels May 25, 2016
@fejta fejta added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 25, 2016
@fejta
Copy link
Contributor

fejta commented May 25, 2016

Is this team/node or cluster or??

@yujuhong yujuhong added team/control-plane sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 25, 2016
@wojtek-t
Copy link
Member

Another instance is here:
http://kubekins.dls.corp.google.com/view/Scalability/job/kubernetes-kubemark-500-gce/3375/

BTW - I think we have another issue open for that, but I can't find it now.

@lavalamp lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 27, 2016
@lavalamp
Copy link
Member

This is control plane. @liggitt? @derekwaynecarr? One of you guys want this?

@wojtek-t
Copy link
Member

So actually the failure from kubemark seems to be a different issue - "event etcd" is not working there, so we weren't able to remove events at all (we weren't even able to list them).
BTW, that's a good question what we should do in such case. @lavalamp - thoughts?

@derekwaynecarr
Copy link
Member

@wojtek-t - I don't think we can safely delete a namespace unless all stores reply back.

this is usually always an issue kubelet side. I can look next week, but would not object to help from others.

@dims
Copy link
Member

dims commented Jun 6, 2016

Is this the same as #23514 ?

@derekwaynecarr
Copy link
Member

@dims - this is different than #23514 which was reporting an error message that had shown a double delete request for a namespace. This issue is different.

@derekwaynecarr
Copy link
Member

I am trying to think of more comprehensive information so these are much easier to debug.

I will look to add a PR that dumps out more information about pods when hung tomorrow morning. The majority of the time the issue is kubelet side, but dumping the state of the pods as seen in the API server would help to know if pods ever started, if images were ever pulled, etc.

@gmarek
Copy link
Contributor

gmarek commented Jun 8, 2016

@wojtek-t - we can turn on event etcd in kubemark - it's implemented but I think it's not enabled by default. @caesarxuchao - do you think it's possible that it's related to ownerRef?

@wojtek-t
Copy link
Member

wojtek-t commented Jun 8, 2016

if it's not enabled by default, why we would even require it in that particular run? this sounds strange...

@gmarek
Copy link
Contributor

gmarek commented Jun 8, 2016

In kubemark - I have no idea what's the default config for gce-serial.

@caesarxuchao
Copy link
Member

No one is setting ownerRef or OrphanDependents, so shouldn't be caused by the gc.

I'm assigned with other "couldn't delete ns" flakes on gke, which doesn't have master logs. This gce flake is a good sample, so I'll take a look tomorrow.

I took a glance at the api server's log, it hasn't received any deletion requests for the remaining pods, which is weird.

@davidopp
Copy link
Member

davidopp commented Jun 8, 2016

@caesarxuchao thanks for your comments! Can we reassign this issue to you? (if so, please just reassign it yourself)

@derekwaynecarr
Copy link
Member

@caesarxuchao - if the pod is already terminating, the issue is that the kubelet needs to send the final delete request (not the namespace controller).

@caesarxuchao caesarxuchao assigned caesarxuchao and unassigned lavalamp Jun 8, 2016
@caesarxuchao
Copy link
Member

@derekwaynecarr thanks for the hint.

@caesarxuchao
Copy link
Member

@deads2k could this be fixed by your #25662? The flake occurred before your 25662 merged.

@yujuhong
Copy link
Contributor Author

The failed test:

[BeforeEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:129
STEP: Creating a kubernetes client
Jun 15 13:59:35.196: INFO: >>> kubeConfig: /var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config

STEP: Building a namespace api object
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:181
[It] should apply a new configuration to an existing RC
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:446
STEP: creating Redis RC
Jun 15 13:59:35.249: INFO: Running '/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/kubernetes/platforms/linux/amd64/kubectl --server=https://104.155.145.244 --kubeconfig=/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config create -f - --namespace=e2e-tests-kubectl-umy2f'
Jun 15 13:59:35.390: INFO: stderr: ""
Jun 15 13:59:35.390: INFO: stdout: "replicationcontroller \"redis-master\" created\n"
STEP: applying a modified configuration
Jun 15 13:59:35.392: INFO: Running '/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/kubernetes/platforms/linux/amd64/kubectl --server=https://104.155.145.244 --kubeconfig=/var/lib/jenkins/workspace/kubernetes-pull-build-test-e2e-gce/.kube/config apply -f - --namespace=e2e-tests-kubectl-umy2f'
Jun 15 13:59:35.619: INFO: stderr: ""
Jun 15 13:59:35.619: INFO: stdout: "replicationcontroller \"redis-master\" configured\n"
STEP: checking the result
[AfterEach] [k8s.io] Kubectl client
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:130
Jun 15 13:59:35.648: INFO: Waiting up to 1m0s for all nodes to be ready
STEP: Destroying namespace "e2e-tests-kubectl-umy2f" for this suite.
Jun 15 14:04:35.684: INFO: Pod e2e-tests-kubectl-umy2f redis-master-wxte5 on node e2e-gce-agent-pr-17-0-minion-group-11xr remains, has deletion timestamp 2016-06-15T14:00:05-07:00
Jun 15 14:04:35.685: INFO: Couldn't delete ns "e2e-tests-kubectl-umy2f": namespace e2e-tests-kubectl-umy2f was not deleted within limit: timed out waiting for the condition, pods remaining: [redis-master-wxte5]


• Failure in Spec Teardown (AfterEach) [300.490 seconds]
[k8s.io] Kubectl client
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:660
  [k8s.io] Kubectl apply [AfterEach]
  /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:660
    should apply a new configuration to an existing RC
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl.go:446

    Jun 15 14:04:35.685: Couldn't delete ns "e2e-tests-kubectl-umy2f": namespace e2e-tests-kubectl-umy2f was not deleted within limit: timed out waiting for the condition, pods remaining: [redis-master-wxte5]

    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:230

kubelet log:

0615 20:59:35.639880    3487 config.go:384] Receiving a new pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640051    3487 kubelet.go:2509] SyncLoop (ADD, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640204    3487 kubelet.go:3503] Generating status for "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.640460    3487 volume_manager.go:254] Waiting for volumes to attach and mount for pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.657424    3487 manager.go:422] Status for pod "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)" updated successfully: {status:{Phase:Pending Conditions:[{Type:Initialized Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason: Message:} {Type:Ready Status:False LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason:ContainersNotReady Message:containers with unready status: [redis-master]} {Type:PodScheduled Status:True LastProbeTime:{Time:{sec:0 nsec:0 loc:0x4473220}} LastTransitionTime:{Time:{sec:63601621175 nsec:0 loc:0x4473220}} Reason: Message:}] Message: Reason: HostIP:10.240.0.8 PodIP: StartTime:0xc82107ae80 InitContainerStatuses:[] ContainerStatuses:[{Name:redis-master State:{Waiting:0xc82107ae40 Running:<nil> Terminated:<nil>} LastTerminationState:{Waiting:<nil> Running:<nil> Terminated:<nil>} Ready:false RestartCount:0 Image:gcr.io/google_containers/redis:e2e ImageID: ContainerID:}]} version:1 podName:redis-master-wxte5 podNamespace:e2e-tests-kubectl-umy2f}
I0615 20:59:35.658707    3487 config.go:269] Setting pods for source api
I0615 20:59:35.659303    3487 kubelet.go:2522] SyncLoop (RECONCILE, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002)"
I0615 20:59:35.753487    3487 config.go:269] Setting pods for source api
I0615 20:59:35.754113    3487 kubelet.go:2516] SyncLoop (UPDATE, "api"): "redis-master-wxte5_e2e-tests-kubectl-umy2f(10f3ad41-333c-11e6-8f34-42010af00002):DeletionTimestamp=2016-06-15T21:00:05Z"
I0615 20:59:35.896890    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:35.897042    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
I0615 20:59:35.897245    3487 empty_dir.go:248] pod 10f3ad41-333c-11e6-8f34-42010af00002: mounting tmpfs for volume wrapped_default-token-l4vpf with opts []
E0615 20:59:35.902804    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:35.902860    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:35.997239    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:35.997343    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.000521    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.000549    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.097542    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.097665    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.100506    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.100814    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.197903    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.198013    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.202820    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.202849    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.298251    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.298364    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.301125    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.301153    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.398611    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.398710    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.403248    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.403278    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.499278    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.499440    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.502350    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.502379    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
I0615 20:59:36.599645    3487 reconciler.go:231] MountVolume operation started for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") to pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002"). 
I0615 20:59:36.599776    3487 secret.go:150] Setting up volume default-token-l4vpf for pod 10f3ad41-333c-11e6-8f34-42010af00002 at /var/lib/kubelet/pods/10f3ad41-333c-11e6-8f34-42010af00002/volumes/kubernetes.io~secret/default-token-l4vpf
E0615 20:59:36.604393    3487 secret.go:168] Couldn't get secret e2e-tests-kubectl-umy2f/default-token-l4vpf
E0615 20:59:36.604426    3487 operation_executor.go:627] MountVolume.SetUp failed for volume "kubernetes.io/secret/default-token-l4vpf" (spec.Name: "default-token-l4vpf") pod "10f3ad41-333c-11e6-8f34-42010af00002" (UID: "10f3ad41-333c-11e6-8f34-42010af00002") with: secrets "default-token-l4vpf" not found
...

kubelet saw the pod at 20:59:35.640051 and got the DELETE request at 20:59:35.754113. The gap was less than 1 second. Not sure why the pod was deleted before it was running (/cc @kubernetes/kubectl).

The more serious problem is that kubelet continue trying the mount the volume (seemingly in a very tight loop) and couldn't retrieve the secret. It never stopped trying to mount.
/cc @kubernetes/sig-storage @saad-ali

@yujuhong yujuhong added sig/storage Categorizes an issue or PR as relevant to SIG Storage. team/cluster and removed team/control-plane labels Jun 15, 2016
@yujuhong yujuhong assigned saad-ali and unassigned caesarxuchao Jun 15, 2016
@yujuhong yujuhong added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 15, 2016
@saad-ali
Copy link
Member

The more serious problem is that kubelet continue trying the mount the volume (seemingly in a very tight loop) and couldn't retrieve the secret. It never stopped trying to mount.

That is expected. The kubelet volume manager runs an asynchronous loop that attempts to mount volumes for pods if they are not mounted (retrying on failure). It does not block any kubelet functionality except the particular pod's start up goroutine, and even that times out eventually as well. But, since it results in API calls, there should be back off logic: opened #27492 to track that.

volumeManager.WaitForAttachAndMount() call blocks for up to 20 minutes currently as it waits for volumes to attach/mount. It is set that high because we've seen cloud operations take several minutes to complete for some volume plugins and as long as the volumes are not ready the pod is not going anywhere. In this case the volumes will never be ready because the secret was deleted. After speaking with @yujuhong, we agreed to drop the timeout to 2 minutes. The idea being if volumes are not ready after 1 minute, we'll fail out and let kubelet decide if the pod is still needed, if it is, kubelet will retry and volumeManager.WaitForAttachAndMount() will be called again.

@ghost
Copy link

ghost commented Jun 16, 2016

I've seen several of these today - latest ones #27332 , #27504.

Searching through the auto-generated e2e flake issues, I see several others that look like they might have the same underlying cause:

#27507
#27503
#27502
#27390

@caesarxuchao
Copy link
Member

@saad-ali so it's the volume mount retry loop prevented kubelet from deleting the pod?

@saad-ali
Copy link
Member

@saad-ali so it's the volume mount retry loop prevented kubelet from deleting the pod?

The syncPod method that is used to set up a pod, calls the volume manager to make sure that all the volumes for the pod are attached/mounted--volumeManager.WaitForAttachAndMount() this call has a super long timeout (20 minutes). The reason for the long timeout for this call is explained in the comment above. These pods appear to be waiting for their secret volume to mount, but mount fails repeatedly because the secret object does not exist (has been deleted?), so volumeManager.WaitForAttachAndMount() continues to wait until its timeout, but before it can return a failure to kubelet, the tests hit their timeout and fail out. As mentioned above, PR #27491 will reduce the timeout which should imitate the previous behavior more closely.

@lavalamp
Copy link
Member

@saad-ali Does that wait get cancelled if the pod is deleted? We need to get this fixed asap or I'm gonna have to roll back your big change :/

@yujuhong
Copy link
Contributor Author

@saad-ali I think the volume manager can stop trying to mount once the DeletionTimestamp of the pod is set (since the manager gets the pods from kubelet anyway), but I didn't read the code to know whether this would be hard to implement.

@derekwaynecarr
Copy link
Member

+1 on having this cancel on pod deletion or agree on roll-back. These
types of flakes are horrible to debug.

On Thu, Jun 16, 2016 at 12:54 PM, Yu-Ju Hong notifications@github.com
wrote:

@saad-ali https://github.com/saad-ali I think the volume manager can
stop trying to mount once the DeletionTimestamp of the pod is set (since
the manager gets the pods from kubelet anyway), but I didn't read the code
to know whether this would be hard to implement.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#26290 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbGFnN5mpeOh_PSZjVxizc_q4_9qMks5qMX_HgaJpZM4Im1k0
.

@yujuhong
Copy link
Contributor Author

To be fair, kubelet doesn't have the ability to cancel a docker operation midway as well. If the worker is blocked at pulling an image, kubelet'd wait until the pull succeeds and delete the pod. We should ideally cancel the sync iteration on a deletion request, but this is not the case now.

The difference here for attachment/mounting is that

  1. The attach/mount is destined to fail because the secret has been deleted, but the pod worker will wait until the timeout hits (while the mounter retries).
  2. The 20 min timeout is too long.

(2) has already been addressed by #27491, which lowered the timeout to 2min.
For (1), it'd be nice to distinguish between "still trying to mount/attach" and "attempt failed, retrying", and let the function returns in the latter case to unblock the pod worker. Alternatively, my previous suggestion should also alleviate the situation.

@saad-ali
Copy link
Member

Serial test take a long time to reflect changes, since they run so slowly (3.5 hours or so). PR #27491 was merged to lower the timeout for waitForAttach at 10:57 PM last night. There were 2 runs on gce-serial that failed overnight after that because of build issues (one was manually stopped the other hit build timeout) then a 7:39 AM run that was green. Same story on the GKE side. So far lowering the timeout seems to have fixed this test. Will continue to monitor subsequent serial runs.

@saad-ali
Copy link
Member

Closing this in favor of the automated issue #27502

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests