GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953

alex-mohr · 2016-01-21T22:07:41Z

Original title: e2e failure: Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host #19953

From http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-1.1/1506/

Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:124 Expected error: <*errors.errorString | 0xc208ce6950>: { s: "gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s not to have occurred

@saad-ali as team/cluster PD expert?

bprashanth · 2016-01-22T19:00:40Z

Saads already got 3 flakes, I can do some initial digging. @saad-ali let me know if you already started on this.

saad-ali · 2016-01-22T21:24:41Z

Taking a look

saad-ali · 2016-01-22T23:59:56Z

The test moves a PD between two nodes.

The first node attached/detached the PD successfully.

...
[node-r7f8] I0121 20:43:12.057247    3491 gce_util.go:148] Successfully attached GCE PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555".
...
[node-r7f8] I0121 20:43:34.420735    3491 gce_util.go:223] Successfully detached GCE PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555".
...

On the second node, the call to attach the PD failed with a backend error:

[node-v4cj] E0121 20:43:44.591983    3490 gce_util.go:136] Error attaching PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555": googleapi: Error 503: Backend Error, backendError

Despite the failure the call actually attached the disk to [node-v4cj], so further attempts by node-v4cj to attach the disk failed because the disk was already attached:

E0121 20:44:02.578215    3490 gce.go:294] GCE operation failed: googleapi: Error 400: The disk resource 'projects/k8s-jkns-e2e-gke-release-1-1/zones/us-central1-f/disks/e2e-81bf8581-c07f-11e5-9b35-42010af01555' is already being used by 'projects/k8s-jkns-e2e-gke-release-1-1/zones/us-central1-f/instances/gke-gke-release-1-1-62a55764-node-v4cj'

Fix: disk is already attached errors should result in success not failure.

alex-mohr · 2016-01-23T00:04:02Z

Fix: disk is already attached errors should result in success not failure.
nit: only if the disk is attached to the correct instance?

saad-ali · 2016-01-23T00:04:22Z

nit: only if the disk is attached to the correct instance?

Right

thockin · 2016-01-25T18:08:35Z

Saad, how deep is your queue?

bprashanth · 2016-01-25T18:46:07Z

I had volunteered on friday to de-flake #19574 today, unless @saad-ali has other plans for that one by now

saad-ali · 2016-01-25T18:47:05Z

@bprashanth Ya, feel free to take #19574

dchen1107 · 2016-01-26T01:04:07Z

We have a new failure at: kubernetes-soak-continuous-e2e-gce/4396/, and related output from failure tests:

11:00:13 Jan 25 11:00:13.658: INFO: Waiting for pod pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555 in namespace 'e2e-tests-pod-disks-dqp13' status to be 'running'(found phase: "Pending", readiness: false) (14m58.61188919s elapsed)
11:00:15 Jan 25 11:00:15.658: INFO: Unexpected error occurred: gave up waiting for pod 'pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555' to be 'running' after 15m0s
11:00:15 STEP: cleaning up PD-RW test environment
11:00:31 STEP: Waiting for PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" to detach from "gce-soak-weekly-minion-atak"
11:00:31 Jan 25 11:00:31.580: INFO: GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" appears to have successfully detached from "gce-soak-weekly-minion-atak".
11:00:35 E0125 11:00:35.209148   19387 gce.go:328] GCE operation failed: googleapi: Error 400: Invalid value for field 'disk': 'gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555'.
11:00:35 STEP: Waiting for PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" to detach from "gce-soak-weekly-minion-p7l9"
11:00:35 Jan 25 11:00:35.295: INFO: GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" appears to have successfully detached from "gce-soak-weekly-minion-p7l9".
11:00:35 STEP: Deleting PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555"
11:00:38 Jan 25 11:00:38.772: INFO: Successfully deleted PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555".
11:00:41 [AfterEach] Pod Disks
11:00:41   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
11:00:41 STEP: Collecting events from namespace "e2e-tests-pod-disks-dqp13".
11:00:41 Jan 25 11:00:38.839: INFO: event for pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555: {kubelet gce-soak-weekly-minion-atak} FailedMount: Unable to mount volumes for pod "pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555_e2e-tests-pod-disks-dqp13(c5c6fbf4-c393-11e5-96b2-42010af00002)": Could not attach GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555". Timeout waiting for mount paths to be created.
11:00:41 Jan 25 11:00:38.839: INFO: event for pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555: {kubelet gce-soak-weekly-minion-atak} FailedSync: Error syncing pod, skipping: Could not attach GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555". Timeout waiting for mount paths to be created.
11:00:41 Jan 25 11:00:38.944: INFO: POD                                                NODE                         PHASE    GRACE  CONDITIONS
11:00:41 Jan 25 11:00:38.944: INFO: elasticsearch-logging-v1-1viuc                     gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:21 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: elasticsearch-logging-v1-oryp2                     gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:15 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-atak  gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:36 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-p7l9  gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:29 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-z3wx  gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:29 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: heapster-v11-4sgxh                                 gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:47 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kibana-logging-v1-85dy5                            gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:26 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-dns-v10-wi4xe                                 gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:22 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-atak             gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:30 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-p7l9             gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:57 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-z3wx             gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:43 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-ui-v4-nhs5j                                   gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:39 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: l7-lb-controller-93swe                             gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:21 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: monitoring-influxdb-grafana-v2-lm7i3               gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-24 16:04:12 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: 
11:00:41 Jan 25 11:00:38.975: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:38.979: INFO: Node Info: &{{ } {gce-soak-weekly-minion-atak   /api/v1/nodes/gce-soak-weekly-minion-atak 95f11adf-bea9-11e5-96b2-42010af00002 1450313 0 2016-01-19 04:38:47 -0800 PST <nil> <nil> map[kubernetes.io/e2e-7287d155-c057-11e5-8ba7-42010af01555:42 kubernetes.io/e2e-0bea78bb-c0a6-11e5-8432-42010af01555:42 kubernetes.io/e2e-4de81177-c348-11e5-9bc1-42010af01555:42 kubernetes.io/e2e-fb473260-bf4b-11e5-974a-42010af01555:42 kubernetes.io/e2e-7e91c333-beeb-11e5-b287-42010af01555:42 kubernetes.io/e2e-957ab4e0-c16f-11e5-b9fa-42010af01555:42 kubernetes.io/e2e-7200c047-c225-11e5-be03-42010af01555:42 kubernetes.io/e2e-d7dc9d4e-c2d7-11e5-9f89-42010af01555:42 kubernetes.io/e2e-100b9d9c-bec2-11e5-b9b6-42010af01555:42 kubernetes.io/e2e-82f498d4-c129-11e5-87ca-42010af01555:42 kubernetes.io/e2e-a1975164-c18e-11e5-af0d-42010af01555:42 kubernetes.io/e2e-d4a5c6e7-bf83-11e5-9080-42010af01555:42 kubernetes.io/e2e-d9273b79-bfb4-11e5-b36f-42010af01555:42 failure-domain.alpha.kubernetes.io/region:us-central1 kubernetes.io/e2e-b4f9a59b-bffc-11e5-9d6f-42010af01555:42 kubernetes.io/e2e-19cebc84-c2c5-11e5-a965-42010af01555:42 kubernetes.io/e2e-73015689-c1d5-11e5-87fa-42010af01555:42 kubernetes.io/e2e-f955293a-c009-11e5-8b84-42010af01555:42 kubernetes.io/e2e-b428ca98-c375-11e5-a3a0-42010af01555:42 kubernetes.io/e2e-225ef8e1-c04b-11e5-b0f9-42010af01555:42 kubernetes.io/e2e-8caf35fe-c28d-11e5-a35a-42010af01555:42 kubernetes.io/e2e-1c021233-bf57-11e5-88ed-42010af01555:42 kubernetes.io/e2e-22201b6a-c2f2-11e5-86b1-42010af01555:42 kubernetes.io/e2e-3909886c-c35c-11e5-8a0f-42010af01555:42 kubernetes.io/e2e-14f7ff91-bf9b-11e5-a70f-42010af01555:42 kubernetes.io/e2e-89e38130-c29b-11e5-b919-42010af01555:42 failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/hostname:gce-soak-weekly-minion-atak kubernetes.io/e2e-d28199c5-beff-11e5-8afe-42010af01555:42 kubernetes.io/e2e-6cbd9004-bec7-11e5-92ad-42010af01555:42 kubernetes.io/e2e-b764753b-c14b-11e5-87e0-42010af01555:42] map[]} {10.252.0.0/24 6846059732836283162 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-atak false} {map[pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI}] map[cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:30 -0800 PST 2016-01-19 04:38:47 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:30 -0800 PST 2016-01-19 04:39:24 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.4} {ExternalIP 104.197.247.242}] {{10250}} { A7BDDD29-B935-6E0B-6A77-35A3E6D318C1 1597a9b7-e5a8-49c8-b663-c332950e271c 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[<none>:<none>] 1113554} {[busybox:latest] 1113554} {[redis:latest] 151314673} {[<none>:<none>] 151314673} {[nginx:latest] 133864957} {[gcr.io/google_containers/elasticsearch:1.8] 410989305} {[gcr.io/google_containers/mounttest:0.6] 2084693} {[gcr.io/google_containers/hostexec:1.2] 13209617} {[gcr.io/google_containers/mounttest-user:0.3] 1718853} {[gcr.io/google_containers/mounttest:0.5] 1718853} {[gcr.io/google_containers/heapster_grafana:v2.1.1] 206641132} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab] 5010921} {[gcr.io/google_containers/netexec:1.3.1] 6530936} {[gcr.io/google_containers/goproxy:0.1] 5489816} {[gcr.io/google_containers/portforwardtester:1.0] 2296329} {[gcr.io/google_containers/netexec:1.1] 7280698} {[gcr.io/google_containers/heapster_influxdb:v0.4] 271818313} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_samples/gb-redisslave:v1] 109508753} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/jessie-dnsutils:latest] 190148402} {[gcr.io/google_containers/mounttest:0.2] 1752375} {[gcr.io/google_containers/dnsutils:latest] 141895666} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/eptest:0.1] 2970692} {[b.gcr.io/k8s_authenticated_test/serve_hostname:1.1 gcr.io/google_containers/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/update-demo:nautilus] 4555533} {[gcr.io/google_containers/update-demo:kitten] 4549069} {[gcr.io/google_containers/nginx:1.7.9] 91664166} {[gcr.io/google_containers/test-webserver:latest] 4534272} {[gcr.io/google_containers/busybox:latest] 2433303} {[gcr.io/google_containers/liveness:latest] 4387474}]}}
11:00:41 Jan 25 11:00:38.993: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:38.996: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:39.024: INFO: monitoring-influxdb-grafana-v2-lm7i3 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-atak started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: elasticsearch-logging-v1-oryp2 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: kube-proxy-gce-soak-weekly-minion-atak started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="create_exec"} => 4 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 2 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 34 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="logs"} => 24 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 76 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 6477 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 18 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:39.128: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.133: INFO: Node Info: &{{ } {gce-soak-weekly-minion-p7l9   /api/v1/nodes/gce-soak-weekly-minion-p7l9 97966ab3-bea9-11e5-96b2-42010af00002 1450314 0 2016-01-19 04:38:50 -0800 PST <nil> <nil> map[failure-domain.alpha.kubernetes.io/region:us-central1 failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/hostname:gce-soak-weekly-minion-p7l9] map[]} {10.245.217.0/24 12319096830899180753 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-p7l9 false} {map[pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI}] map[memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:35 -0800 PST 2016-01-19 04:38:50 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:35 -0800 PST 2016-01-19 04:39:27 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.5} {ExternalIP 104.197.253.148}] {{10250}} { 8D962FBD-B826-6128-D806-250CE51E6543 4de3b450-51a2-4a63-a664-8fc9f4b3c061 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[gcr.io/google_containers/kube2sky:1.12] 24482187} {[gcr.io/google_containers/kube-ui:v4] 5676927} {[gcr.io/google_containers/skydns:2015-10-13-8c72f8c] 40551394} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/exechealthz:1.0] 7099444} {[gcr.io/google_containers/kibana:1.3] 396897764} {[gcr.io/google_containers/etcd:2.0.9] 12819040} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/serve_hostname:1.1 b.gcr.io/k8s_authenticated_test/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/busybox:latest] 2433303}]}}
11:00:41 Jan 25 11:00:39.133: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.135: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.154: INFO: kibana-logging-v1-85dy5 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-dns-v10-wi4xe started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-proxy-gce-soak-weekly-minion-p7l9 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-ui-v4-nhs5j started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-p7l9 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.221: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 1 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 22 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="pull_image"} => 2 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 54 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 3684 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 17 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.222: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.226: INFO: Node Info: &{{ } {gce-soak-weekly-minion-z3wx   /api/v1/nodes/gce-soak-weekly-minion-z3wx 960b2240-bea9-11e5-96b2-42010af00002 1450315 0 2016-01-19 04:38:48 -0800 PST <nil> <nil> map[kubernetes.io/e2e-34446fb8-bf1e-11e5-ba22-42010af01555:42 kubernetes.io/e2e-099fbdde-c251-11e5-816a-42010af01555:42 kubernetes.io/e2e-3de203e7-bfc3-11e5-928e-42010af01555:42 kubernetes.io/e2e-08902a6b-c1bc-11e5-bb08-42010af01555:42 kubernetes.io/e2e-6c60738c-c020-11e5-a295-42010af01555:42 kubernetes.io/e2e-b4231c98-c0f2-11e5-b0c6-42010af01555:42 kubernetes.io/hostname:gce-soak-weekly-minion-z3wx failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/e2e-4f74ac43-c238-11e5-a110-42010af01555:42 kubernetes.io/e2e-cc3a0b63-c0b6-11e5-848a-42010af01555:42 kubernetes.io/e2e-e84b8a83-bfd6-11e5-9878-42010af01555:42 kubernetes.io/e2e-f7cec822-c138-11e5-a64c-42010af01555:42 kubernetes.io/e2e-4d68d823-c381-11e5-ba64-42010af01555:42 kubernetes.io/e2e-f5ce6b92-c27c-11e5-a8d8-42010af01555:42 kubernetes.io/e2e-6adeb5fa-c069-11e5-bcd8-42010af01555:42 kubernetes.io/e2e-89f46236-c207-11e5-be5b-42010af01555:42 kubernetes.io/e2e-cf4c75f5-c0d6-11e5-8247-42010af01555:42 kubernetes.io/e2e-cab749bd-c1e5-11e5-8d7f-42010af01555:42 kubernetes.io/e2e-9fbf6a13-c2fe-11e5-837a-42010af01555:42 kubernetes.io/e2e-616eff0f-c07f-11e5-8dd1-42010af01555:42 kubernetes.io/e2e-ed6e690f-c18b-11e5-b2ce-42010af01555:42 failure-domain.alpha.kubernetes.io/region:us-central1 kubernetes.io/e2e-05064625-bf39-11e5-b82a-42010af01555:42 kubernetes.io/e2e-8551b179-c1ff-11e5-aa5f-42010af01555:42 kubernetes.io/e2e-fe3c6e7a-c10a-11e5-9d6f-42010af01555:42 kubernetes.io/e2e-1acbdafd-c31c-11e5-8425-42010af01555:42] map[]} {10.249.224.0/24 17347648558513692064 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-z3wx false} {map[cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI}] map[memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:37 -0800 PST 2016-01-19 04:38:47 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:37 -0800 PST 2016-01-19 04:39:23 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.3} {ExternalIP 104.197.241.180}] {{10250}} { 07C21C4D-23FF-875D-C06C-201D9F9D5D17 cba94176-d511-4129-9e94-d65d2a942d25 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[busybox:latest] 1113554} {[gcr.io/google_containers/heapster:v0.18.4] 37026444} {[<none>:<none>] 151314673} {[redis:latest] 151314673} {[nginx:latest] 133864957} {[gcr.io/google_containers/elasticsearch:1.8] 410989305} {[gcr.io/google_containers/glbc:0.5.1] 224751780} {[gcr.io/google_containers/mounttest:0.6] 2084693} {[gcr.io/google_containers/hostexec:1.2] 13209617} {[gcr.io/google_containers/mounttest-user:0.3] 1718853} {[gcr.io/google_containers/defaultbackend:1.0] 7513643} {[gcr.io/google_containers/mounttest:0.5] 1718853} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab] 5010921} {[gcr.io/google_containers/netexec:1.3.1] 6530936} {[gcr.io/google_containers/goproxy:0.1] 5489816} {[gcr.io/google_containers/portforwardtester:1.0] 2296329} {[gcr.io/google_containers/netexec:1.1] 7280698} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_samples/gb-redisslave:v1] 109508753} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/mounttest:0.2] 1752375} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/eptest:0.1] 2970692} {[b.gcr.io/k8s_authenticated_test/serve_hostname:1.1 gcr.io/google_containers/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/update-demo:kitten] 4549069} {[gcr.io/google_containers/update-demo:nautilus] 4555533} {[gcr.io/google_containers/nginx:1.7.9] 91664166} {[gcr.io/google_containers/test-webserver:latest] 4534272} {[gcr.io/google_containers/busybox:latest] 2433303} {[gcr.io/google_containers/liveness:latest] 4387474}]}}
11:00:41 Jan 25 11:00:39.227: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.361: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.571: INFO: heapster-v11-4sgxh started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-z3wx started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: elasticsearch-logging-v1-1viuc started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: l7-lb-controller-93swe started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: kube-proxy-gce-soak-weekly-minion-z3wx started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.880: INFO: ERROR kubelet_docker_errors{operation_type="create_exec"} => 2 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 4 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 35 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="logs"} => 26 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="pull_image"} => 1 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 50 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 6612 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 18 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.881: INFO: Waiting up to 1m0s for all nodes to be ready
11:00:41 STEP: Destroying namespace "e2e-tests-pod-disks-dqp13" for this suite.
11:00:41 
11:00:41 • Failure [929.205 seconds]
11:00:41 Pod Disks
11:00:41 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:267
11:00:41   should schedule a pod w/ a RW PD, remove it, then schedule it on another host [It]

dchen1107 · 2016-01-26T01:07:36Z

Another failure instance for PD test: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-soak-continuous-e2e-gce/4397/ , but different test case:

Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:266 Expected error: <*errors.errorString | 0xc2083d9e50>: { s: "gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s not to have occurred

I guess they might have the same cause, thus not create a new issue. Feel free to create another one though.

pmorie · 2016-01-26T04:43:10Z

@kubernetes/rh-storage

saad-ali · 2016-01-26T07:41:59Z

Thanks @dchen1107. Confirmed that both those continuous E2E builds hit the same issue.

kubernetes-soak-continuous-e2e-gce/4396:

E0125 18:49:15.515200    3488 gce_util.go:170] Error attaching PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555": googleapi: Error 400: The disk resource 'projects/kubernetes-jenkins/zones/us-central1-f/disks/gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555' is already being used by 'projects/kubernetes-jenkins/zones/us-central1-f/instances/gce-soak-weekly-minion-atak'

Basically this bug exposes the house of cards that is the current state of the attach/detach logic in kubelet. In the existing code there is a race condition between the asynchronous pod creation loop and the orphaned pods clean up loop. Although there is some logic to make sure that the actual attach/detach operations don't step on each others' toes, their is basically no synchronization between the remaining logic in the two loops responsible for setting up and tearing down volumes, so the loops end up racing. As it is currently written, the code "failing attach when the disk is already attached" behavior actually masks a very common race condition that happens when a pod is rapidly created and deleted and recreated: The 2nd attach operation tends to get triggered slightly before the detach operation, and in the current code, it eventually fails out (since being attached to the same node is considered an error), then the detach goes through and pulls the PD, the attach loop comes back around in the meantime retires and then all is well.

If we change just this logic ("disk is already attached errors should result in success not failure"), we'll end up exposing that, far nastier bug, that pulls disks from under users as they are using them: disk will get attached, second attach operation comes in just before the detach and succeeds, detach comes along and pulls the disk while it is in use (looks like data loss to user).

The correct fix is larger: interlock attach/detach such that they do not operate independently of each other. I'm tackling this as part of the larger attach/detach controller redesign, which likely will not be part of 1.2. Therefore a fix for this is unlikely to be part of 1.2.

jsafrane · 2016-01-26T08:43:58Z

@saad-ali, this looks very similar to #19602. I fixed it for Cinder, #19953, I guess a hotfix for GCE would look the same. Proper attach/detach controller will hopefuly fix it for all.

saad-ali · 2016-01-26T20:57:40Z

There is similar logic on the GCE PD side to prevent the attach/detach operations from interrupting each other, but that does nothing to guarantee the order of operations--which is the key issue here. That will be addressed by unifying the loops.

dchen1107 · 2016-01-28T17:10:59Z

Failed again: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-subnet/1647/

thockin · 2016-01-28T18:39:56Z

IIUC, the FIX is O(hard). Is there a short term mitigation?

spxtr · 2016-01-28T23:41:05Z

kubernetes-e2e-gce #10571

saad-ali · 2016-01-28T23:45:21Z

IIUC, the FIX is O(hard). Is there a short term mitigation?

Looking into it

thockin · 2016-02-01T19:07:20Z

any thoughts on this? If the answer is "not until the real fix", that's what we need to know.

saad-ali · 2016-02-01T19:38:16Z

I played with a bunch of hacks, like skipping the teardown step if there are pending setups in progress for a given volume. But they all end up destabilizing the system even more. I'm tempted to just implement the "real fix", but that seems like a bad idea to get in at the last minute. I'll keep experimenting, and see if I can get something before Friday, but let's go with "not until the real fix" for now.

bprashanth · 2016-02-01T19:49:40Z

Consider flipping this to a "[Feature:PD]" test if you need time, since it sounds like the problem is well understood

saad-ali · 2016-02-24T23:23:01Z

the fix is complicated and will require big changes to the current volume mount/unmount logic

Proposed design here: #21931. Implementation in progress.

gmarek · 2016-05-17T15:31:21Z

@saad-ali - any update on this?

saad-ali · 2016-05-17T17:53:48Z

@gmarek lots of work on-going (e.g. #25457) targeted at 1.3

lavalamp · 2016-05-24T00:55:34Z

This is breaking often enough that it's bypassing my submit queue optimization - can we move this test to flaky until we get it fixed?

lavalamp · 2016-05-24T00:55:51Z

Specifically the test "Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession"

saad-ali · 2016-05-24T01:03:26Z

can we move this test to flaky until we get it fixed

Done, see PR #26089
There was a PR that went in over the weekend (#21709) that destabalized this test.

…Changes Automatic merge from submit-queue Attach/Detach Controller Kubelet Changes This PR contains changes to enable attach/detach controller proposed in #20262. Specifically it: * Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled. * Removes all references `SafeToDetach` annotation from controller. * Adds the new `VolumesInUse` field to the Node Status API object. * Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment. * Modifies kubelet to set `VolumesInUse` before Mount and after Unmount. * There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue kubernetes/node-problem-detector#9 (comment) opened to fix that. * There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor. * Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first. * Fix misc bugs in controller. * Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached. Fixes #14642, #19953 ```release-note Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation. A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not. ```

goltermann · 2016-06-06T16:59:33Z

Is this one now fixed since #26351 went in?

saad-ali · 2016-06-06T17:45:38Z

Yes, this is fixed with #26351

orian · 2017-01-12T14:47:58Z

Just got this error.

With 2 nodes cluster. The number of replicas is 1.

I've use deployment type of update: Recreate.

From events I've noticed that pod is created before the the disk get detached from one node and attached to the other one (it gets attached after some time).

The documentation doesn't state if there's a retry mechanism in place.

I've checked the attachment of disk by running:

gcloud compute disks describe somedisk

It may be a Google Container Engine issue.

jchauncey · 2017-04-19T21:24:59Z

I am also seeing this issue with a 1.5 cluster on GCE

bbhoss · 2017-05-16T01:48:17Z

To pile on, I am seeing this issue on a 1.6.2 GKE cluster. We have a persistent group and a preemptible one, and we've started seeing this quite a bit recently. I believe this may be a regression in 1.6, I have seen it happen before but it usually clears up. Now it requires manual intervention.

shimizuf · 2017-05-29T11:06:36Z

We're also seeing this on a 1.6.4 GKE cluster. When a pod with a PD is destroyed, the re-created one fails with "The disk resource [...] is already being used by [...]".

andrewhowdencom · 2017-06-05T16:45:11Z

Also seeing this. In my case, it's being attached to a pod run by a job, and when that job runs to completion, the disk isn't being detached.

Resource is here: https://github.com/andrewhowdencom/m2onk8s/blob/master/deploy/helm/charts/magento/templates/install.job.yaml

It's being consumed by the deployment in that same folder.

 kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

@bbhoss @shimizuf perhaps we should open a new issue rather than trying to necro this one?

gm42 · 2017-07-07T11:54:18Z

Please re-open, this is still happening with v1.6.4

saad-ali · 2017-07-14T22:50:51Z

Instead of re-opening this bug, I've created a new bug: #48968 Please carry on conversation there

Set etcd DialTimeout, fix etcd start order in all-in-one Origin-commit: 116b179defb1dcbbe5e734efbee1b038a64fc4c7

alex-mohr added area/test team/cluster labels Jan 21, 2016

alex-mohr assigned saad-ali Jan 21, 2016

alex-mohr added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 22, 2016

bprashanth assigned bprashanth and unassigned saad-ali Jan 22, 2016

saad-ali assigned saad-ali and unassigned bprashanth Jan 22, 2016

lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 25, 2016

bprashanth modified the milestones: next-candidate, v1.2-candidate Feb 19, 2016

This was referenced Feb 20, 2016

Flake: Pod never became running in Kubernetes e2e suite.Generated release_1_2 #21403

Closed

Detailed Design for Volume Mount/Unmount Redesign #21931

Closed

saad-ali mentioned this issue Feb 26, 2016

e2e flake: "Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession" #21830

Closed

saad-ali mentioned this issue Mar 7, 2016

EBS volume mount failures due to "... already attached to an instance" are not retried #18785

Closed

jsafrane mentioned this issue May 17, 2016

Flake: [k8s.io] Dynamic provisioning [k8s.io] DynamicProvisioner [It] should create and delete persistent volumes #24161

Closed

gmarek added the kind/flake Categorizes issue or PR as related to a flaky test. label May 17, 2016

This was referenced May 20, 2016

Problem rescheduling POD with GCE PD disk attached #14642

Closed

[GKE] Pod Can't Mount Volumes (Fails to Create Paths) Won't Leave ContainerCreating State #25828

Closed

saad-ali mentioned this issue May 27, 2016

Enable Attach/Detach Controller #26351

Merged

saad-ali closed this as completed Jun 6, 2016

saad-ali mentioned this issue Jul 14, 2017

Deployments with GCE PD fail with "...is already being used by..." #48968

Closed

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jun 10, 2018

Merge pull request kubernetes#19953 from liggitt/etcd-dial-timeout

e9a53f2

Set etcd DialTimeout, fix etcd start order in all-in-one Origin-commit: 116b179defb1dcbbe5e734efbee1b038a64fc4c7

GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953

GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953

Comments

alex-mohr commented Jan 21, 2016

bprashanth commented Jan 22, 2016

saad-ali commented Jan 22, 2016

saad-ali commented Jan 22, 2016

alex-mohr commented Jan 23, 2016

saad-ali commented Jan 23, 2016

thockin commented Jan 25, 2016

bprashanth commented Jan 25, 2016

saad-ali commented Jan 25, 2016

dchen1107 commented Jan 26, 2016

dchen1107 commented Jan 26, 2016

pmorie commented Jan 26, 2016

saad-ali commented Jan 26, 2016

jsafrane commented Jan 26, 2016

saad-ali commented Jan 26, 2016

dchen1107 commented Jan 28, 2016

thockin commented Jan 28, 2016

spxtr commented Jan 28, 2016

saad-ali commented Jan 28, 2016

thockin commented Feb 1, 2016

saad-ali commented Feb 1, 2016

bprashanth commented Feb 1, 2016

saad-ali commented Feb 24, 2016

gmarek commented May 17, 2016

saad-ali commented May 17, 2016

lavalamp commented May 24, 2016

lavalamp commented May 24, 2016

saad-ali commented May 24, 2016

goltermann commented Jun 6, 2016

saad-ali commented Jun 6, 2016

orian commented Jan 12, 2017 • edited Loading

jchauncey commented Apr 19, 2017 • edited Loading

bbhoss commented May 16, 2017 • edited Loading

shimizuf commented May 29, 2017

andrewhowdencom commented Jun 5, 2017

gm42 commented Jul 7, 2017

saad-ali commented Jul 14, 2017

orian commented Jan 12, 2017 •

edited

Loading

jchauncey commented Apr 19, 2017 •

edited

Loading

bbhoss commented May 16, 2017 •

edited

Loading