Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE PD Volumes already attached to a node fail with "Error 400: The disk resource is already being used by" node #19953

Closed
alex-mohr opened this issue Jan 21, 2016 · 40 comments
Assignees
Labels
area/platform/gce kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@alex-mohr
Copy link
Contributor

Original title: e2e failure: Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host #19953

From http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-1.1/1506/

Pod Disks should schedule a pod w/ a RW PD, remove it, then schedule it on another host

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:124 Expected error: <*errors.errorString | 0xc208ce6950>: { s: "gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-893001f0-c07f-11e5-9b35-42010af01555' to be 'running' after 15m0s not to have occurred

@saad-ali as team/cluster PD expert?

@alex-mohr alex-mohr added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 22, 2016
@bprashanth
Copy link
Contributor

Saads already got 3 flakes, I can do some initial digging. @saad-ali let me know if you already started on this.

@bprashanth bprashanth assigned bprashanth and unassigned saad-ali Jan 22, 2016
@saad-ali saad-ali assigned saad-ali and unassigned bprashanth Jan 22, 2016
@saad-ali
Copy link
Member

Taking a look

@saad-ali
Copy link
Member

The test moves a PD between two nodes.

The first node attached/detached the PD successfully.

...
[node-r7f8] I0121 20:43:12.057247    3491 gce_util.go:148] Successfully attached GCE PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555".
...
[node-r7f8] I0121 20:43:34.420735    3491 gce_util.go:223] Successfully detached GCE PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555".
...

On the second node, the call to attach the PD failed with a backend error:

[node-v4cj] E0121 20:43:44.591983    3490 gce_util.go:136] Error attaching PD "e2e-81bf8581-c07f-11e5-9b35-42010af01555": googleapi: Error 503: Backend Error, backendError

Despite the failure the call actually attached the disk to [node-v4cj], so further attempts by node-v4cj to attach the disk failed because the disk was already attached:

E0121 20:44:02.578215    3490 gce.go:294] GCE operation failed: googleapi: Error 400: The disk resource 'projects/k8s-jkns-e2e-gke-release-1-1/zones/us-central1-f/disks/e2e-81bf8581-c07f-11e5-9b35-42010af01555' is already being used by 'projects/k8s-jkns-e2e-gke-release-1-1/zones/us-central1-f/instances/gke-gke-release-1-1-62a55764-node-v4cj'

Fix: disk is already attached errors should result in success not failure.

@alex-mohr
Copy link
Contributor Author

Fix: disk is already attached errors should result in success not failure.
nit: only if the disk is attached to the correct instance?

@saad-ali
Copy link
Member

nit: only if the disk is attached to the correct instance?

Right

@lavalamp lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 25, 2016
@thockin
Copy link
Member

thockin commented Jan 25, 2016

Saad, how deep is your queue?

@bprashanth
Copy link
Contributor

I had volunteered on friday to de-flake #19574 today, unless @saad-ali has other plans for that one by now

@saad-ali
Copy link
Member

@bprashanth Ya, feel free to take #19574

@dchen1107
Copy link
Member

We have a new failure at: kubernetes-soak-continuous-e2e-gce/4396/, and related output from failure tests:

11:00:13 Jan 25 11:00:13.658: INFO: Waiting for pod pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555 in namespace 'e2e-tests-pod-disks-dqp13' status to be 'running'(found phase: "Pending", readiness: false) (14m58.61188919s elapsed)
11:00:15 Jan 25 11:00:15.658: INFO: Unexpected error occurred: gave up waiting for pod 'pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555' to be 'running' after 15m0s
11:00:15 STEP: cleaning up PD-RW test environment
11:00:31 STEP: Waiting for PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" to detach from "gce-soak-weekly-minion-atak"
11:00:31 Jan 25 11:00:31.580: INFO: GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" appears to have successfully detached from "gce-soak-weekly-minion-atak".
11:00:35 E0125 11:00:35.209148   19387 gce.go:328] GCE operation failed: googleapi: Error 400: Invalid value for field 'disk': 'gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555'.
11:00:35 STEP: Waiting for PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" to detach from "gce-soak-weekly-minion-p7l9"
11:00:35 Jan 25 11:00:35.295: INFO: GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555" appears to have successfully detached from "gce-soak-weekly-minion-p7l9".
11:00:35 STEP: Deleting PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555"
11:00:38 Jan 25 11:00:38.772: INFO: Successfully deleted PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555".
11:00:41 [AfterEach] Pod Disks
11:00:41   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework.go:70
11:00:41 STEP: Collecting events from namespace "e2e-tests-pod-disks-dqp13".
11:00:41 Jan 25 11:00:38.839: INFO: event for pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555: {kubelet gce-soak-weekly-minion-atak} FailedMount: Unable to mount volumes for pod "pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555_e2e-tests-pod-disks-dqp13(c5c6fbf4-c393-11e5-96b2-42010af00002)": Could not attach GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555". Timeout waiting for mount paths to be created.
11:00:41 Jan 25 11:00:38.839: INFO: event for pd-test-c5c6aec8-c393-11e5-8a0b-42010af01555: {kubelet gce-soak-weekly-minion-atak} FailedSync: Error syncing pod, skipping: Could not attach GCE PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555". Timeout waiting for mount paths to be created.
11:00:41 Jan 25 11:00:38.944: INFO: POD                                                NODE                         PHASE    GRACE  CONDITIONS
11:00:41 Jan 25 11:00:38.944: INFO: elasticsearch-logging-v1-1viuc                     gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:21 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: elasticsearch-logging-v1-oryp2                     gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:15 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-atak  gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:36 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-p7l9  gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:29 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-z3wx  gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:29 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: heapster-v11-4sgxh                                 gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:47 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kibana-logging-v1-85dy5                            gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:26 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-dns-v10-wi4xe                                 gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:22 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-atak             gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:30 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-p7l9             gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:57 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-proxy-gce-soak-weekly-minion-z3wx             gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:43 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: kube-ui-v4-nhs5j                                   gce-soak-weekly-minion-p7l9  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:39:39 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: l7-lb-controller-93swe                             gce-soak-weekly-minion-z3wx  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-19 04:40:21 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: monitoring-influxdb-grafana-v2-lm7i3               gce-soak-weekly-minion-atak  Running         [{Ready True 0001-01-01 00:00:00 +0000 UTC 2016-01-24 16:04:12 -0800 PST  }]
11:00:41 Jan 25 11:00:38.967: INFO: 
11:00:41 Jan 25 11:00:38.975: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:38.979: INFO: Node Info: &{{ } {gce-soak-weekly-minion-atak   /api/v1/nodes/gce-soak-weekly-minion-atak 95f11adf-bea9-11e5-96b2-42010af00002 1450313 0 2016-01-19 04:38:47 -0800 PST <nil> <nil> map[kubernetes.io/e2e-7287d155-c057-11e5-8ba7-42010af01555:42 kubernetes.io/e2e-0bea78bb-c0a6-11e5-8432-42010af01555:42 kubernetes.io/e2e-4de81177-c348-11e5-9bc1-42010af01555:42 kubernetes.io/e2e-fb473260-bf4b-11e5-974a-42010af01555:42 kubernetes.io/e2e-7e91c333-beeb-11e5-b287-42010af01555:42 kubernetes.io/e2e-957ab4e0-c16f-11e5-b9fa-42010af01555:42 kubernetes.io/e2e-7200c047-c225-11e5-be03-42010af01555:42 kubernetes.io/e2e-d7dc9d4e-c2d7-11e5-9f89-42010af01555:42 kubernetes.io/e2e-100b9d9c-bec2-11e5-b9b6-42010af01555:42 kubernetes.io/e2e-82f498d4-c129-11e5-87ca-42010af01555:42 kubernetes.io/e2e-a1975164-c18e-11e5-af0d-42010af01555:42 kubernetes.io/e2e-d4a5c6e7-bf83-11e5-9080-42010af01555:42 kubernetes.io/e2e-d9273b79-bfb4-11e5-b36f-42010af01555:42 failure-domain.alpha.kubernetes.io/region:us-central1 kubernetes.io/e2e-b4f9a59b-bffc-11e5-9d6f-42010af01555:42 kubernetes.io/e2e-19cebc84-c2c5-11e5-a965-42010af01555:42 kubernetes.io/e2e-73015689-c1d5-11e5-87fa-42010af01555:42 kubernetes.io/e2e-f955293a-c009-11e5-8b84-42010af01555:42 kubernetes.io/e2e-b428ca98-c375-11e5-a3a0-42010af01555:42 kubernetes.io/e2e-225ef8e1-c04b-11e5-b0f9-42010af01555:42 kubernetes.io/e2e-8caf35fe-c28d-11e5-a35a-42010af01555:42 kubernetes.io/e2e-1c021233-bf57-11e5-88ed-42010af01555:42 kubernetes.io/e2e-22201b6a-c2f2-11e5-86b1-42010af01555:42 kubernetes.io/e2e-3909886c-c35c-11e5-8a0f-42010af01555:42 kubernetes.io/e2e-14f7ff91-bf9b-11e5-a70f-42010af01555:42 kubernetes.io/e2e-89e38130-c29b-11e5-b919-42010af01555:42 failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/hostname:gce-soak-weekly-minion-atak kubernetes.io/e2e-d28199c5-beff-11e5-8afe-42010af01555:42 kubernetes.io/e2e-6cbd9004-bec7-11e5-92ad-42010af01555:42 kubernetes.io/e2e-b764753b-c14b-11e5-87e0-42010af01555:42] map[]} {10.252.0.0/24 6846059732836283162 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-atak false} {map[pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI}] map[cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:30 -0800 PST 2016-01-19 04:38:47 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:30 -0800 PST 2016-01-19 04:39:24 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.4} {ExternalIP 104.197.247.242}] {{10250}} { A7BDDD29-B935-6E0B-6A77-35A3E6D318C1 1597a9b7-e5a8-49c8-b663-c332950e271c 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[<none>:<none>] 1113554} {[busybox:latest] 1113554} {[redis:latest] 151314673} {[<none>:<none>] 151314673} {[nginx:latest] 133864957} {[gcr.io/google_containers/elasticsearch:1.8] 410989305} {[gcr.io/google_containers/mounttest:0.6] 2084693} {[gcr.io/google_containers/hostexec:1.2] 13209617} {[gcr.io/google_containers/mounttest-user:0.3] 1718853} {[gcr.io/google_containers/mounttest:0.5] 1718853} {[gcr.io/google_containers/heapster_grafana:v2.1.1] 206641132} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab] 5010921} {[gcr.io/google_containers/netexec:1.3.1] 6530936} {[gcr.io/google_containers/goproxy:0.1] 5489816} {[gcr.io/google_containers/portforwardtester:1.0] 2296329} {[gcr.io/google_containers/netexec:1.1] 7280698} {[gcr.io/google_containers/heapster_influxdb:v0.4] 271818313} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_samples/gb-redisslave:v1] 109508753} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/jessie-dnsutils:latest] 190148402} {[gcr.io/google_containers/mounttest:0.2] 1752375} {[gcr.io/google_containers/dnsutils:latest] 141895666} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/eptest:0.1] 2970692} {[b.gcr.io/k8s_authenticated_test/serve_hostname:1.1 gcr.io/google_containers/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/update-demo:nautilus] 4555533} {[gcr.io/google_containers/update-demo:kitten] 4549069} {[gcr.io/google_containers/nginx:1.7.9] 91664166} {[gcr.io/google_containers/test-webserver:latest] 4534272} {[gcr.io/google_containers/busybox:latest] 2433303} {[gcr.io/google_containers/liveness:latest] 4387474}]}}
11:00:41 Jan 25 11:00:38.993: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:38.996: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:39.024: INFO: monitoring-influxdb-grafana-v2-lm7i3 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-atak started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: elasticsearch-logging-v1-oryp2 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.024: INFO: kube-proxy-gce-soak-weekly-minion-atak started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="create_exec"} => 4 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 2 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 34 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="logs"} => 24 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 76 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 6477 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 18 @[0]
11:00:41 Jan 25 11:00:39.128: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-atak
11:00:41 Jan 25 11:00:39.128: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.133: INFO: Node Info: &{{ } {gce-soak-weekly-minion-p7l9   /api/v1/nodes/gce-soak-weekly-minion-p7l9 97966ab3-bea9-11e5-96b2-42010af00002 1450314 0 2016-01-19 04:38:50 -0800 PST <nil> <nil> map[failure-domain.alpha.kubernetes.io/region:us-central1 failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/hostname:gce-soak-weekly-minion-p7l9] map[]} {10.245.217.0/24 12319096830899180753 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-p7l9 false} {map[pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI}] map[memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:35 -0800 PST 2016-01-19 04:38:50 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:35 -0800 PST 2016-01-19 04:39:27 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.5} {ExternalIP 104.197.253.148}] {{10250}} { 8D962FBD-B826-6128-D806-250CE51E6543 4de3b450-51a2-4a63-a664-8fc9f4b3c061 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[gcr.io/google_containers/kube2sky:1.12] 24482187} {[gcr.io/google_containers/kube-ui:v4] 5676927} {[gcr.io/google_containers/skydns:2015-10-13-8c72f8c] 40551394} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/exechealthz:1.0] 7099444} {[gcr.io/google_containers/kibana:1.3] 396897764} {[gcr.io/google_containers/etcd:2.0.9] 12819040} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/serve_hostname:1.1 b.gcr.io/k8s_authenticated_test/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/busybox:latest] 2433303}]}}
11:00:41 Jan 25 11:00:39.133: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.135: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.154: INFO: kibana-logging-v1-85dy5 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-dns-v10-wi4xe started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-proxy-gce-soak-weekly-minion-p7l9 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: kube-ui-v4-nhs5j started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.154: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-p7l9 started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.221: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 1 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 22 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="pull_image"} => 2 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 54 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 3684 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 17 @[0]
11:00:41 Jan 25 11:00:39.222: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-p7l9
11:00:41 Jan 25 11:00:39.222: INFO: 
11:00:41 Logging node info for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.226: INFO: Node Info: &{{ } {gce-soak-weekly-minion-z3wx   /api/v1/nodes/gce-soak-weekly-minion-z3wx 960b2240-bea9-11e5-96b2-42010af00002 1450315 0 2016-01-19 04:38:48 -0800 PST <nil> <nil> map[kubernetes.io/e2e-34446fb8-bf1e-11e5-ba22-42010af01555:42 kubernetes.io/e2e-099fbdde-c251-11e5-816a-42010af01555:42 kubernetes.io/e2e-3de203e7-bfc3-11e5-928e-42010af01555:42 kubernetes.io/e2e-08902a6b-c1bc-11e5-bb08-42010af01555:42 kubernetes.io/e2e-6c60738c-c020-11e5-a295-42010af01555:42 kubernetes.io/e2e-b4231c98-c0f2-11e5-b0c6-42010af01555:42 kubernetes.io/hostname:gce-soak-weekly-minion-z3wx failure-domain.alpha.kubernetes.io/zone:us-central1-f kubernetes.io/e2e-4f74ac43-c238-11e5-a110-42010af01555:42 kubernetes.io/e2e-cc3a0b63-c0b6-11e5-848a-42010af01555:42 kubernetes.io/e2e-e84b8a83-bfd6-11e5-9878-42010af01555:42 kubernetes.io/e2e-f7cec822-c138-11e5-a64c-42010af01555:42 kubernetes.io/e2e-4d68d823-c381-11e5-ba64-42010af01555:42 kubernetes.io/e2e-f5ce6b92-c27c-11e5-a8d8-42010af01555:42 kubernetes.io/e2e-6adeb5fa-c069-11e5-bcd8-42010af01555:42 kubernetes.io/e2e-89f46236-c207-11e5-be5b-42010af01555:42 kubernetes.io/e2e-cf4c75f5-c0d6-11e5-8247-42010af01555:42 kubernetes.io/e2e-cab749bd-c1e5-11e5-8d7f-42010af01555:42 kubernetes.io/e2e-9fbf6a13-c2fe-11e5-837a-42010af01555:42 kubernetes.io/e2e-616eff0f-c07f-11e5-8dd1-42010af01555:42 kubernetes.io/e2e-ed6e690f-c18b-11e5-b2ce-42010af01555:42 failure-domain.alpha.kubernetes.io/region:us-central1 kubernetes.io/e2e-05064625-bf39-11e5-b82a-42010af01555:42 kubernetes.io/e2e-8551b179-c1ff-11e5-aa5f-42010af01555:42 kubernetes.io/e2e-fe3c6e7a-c10a-11e5-9d6f-42010af01555:42 kubernetes.io/e2e-1acbdafd-c31c-11e5-8425-42010af01555:42] map[]} {10.249.224.0/24 17347648558513692064 gce://kubernetes-jenkins/us-central1-f/gce-soak-weekly-minion-z3wx false} {map[cpu:{2.000 DecimalSI} memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI}] map[memory:{7863918592.000 BinarySI} pods:{110.000 DecimalSI} cpu:{2.000 DecimalSI}]  [{OutOfDisk False 2016-01-25 11:00:37 -0800 PST 2016-01-19 04:38:47 -0800 PST KubeletHasSufficientDisk kubelet has sufficient disk space available} {Ready True 2016-01-25 11:00:37 -0800 PST 2016-01-19 04:39:23 -0800 PST KubeletReady kubelet is posting ready status}] [{InternalIP 10.240.0.3} {ExternalIP 104.197.241.180}] {{10250}} { 07C21C4D-23FF-875D-C06C-201D9F9D5D17 cba94176-d511-4129-9e94-d65d2a942d25 3.16.0-0.bpo.4-amd64 Debian GNU/Linux 7 (wheezy) docker://1.8.3 v1.2.0-alpha.6.204+810544633e3c12 v1.2.0-alpha.6.204+810544633e3c12} [{[gcr.io/google_containers/kube-proxy:175a5358826110d18f6a2b3be1bcf86e] 162771509} {[busybox:latest] 1113554} {[gcr.io/google_containers/heapster:v0.18.4] 37026444} {[<none>:<none>] 151314673} {[redis:latest] 151314673} {[nginx:latest] 133864957} {[gcr.io/google_containers/elasticsearch:1.8] 410989305} {[gcr.io/google_containers/glbc:0.5.1] 224751780} {[gcr.io/google_containers/mounttest:0.6] 2084693} {[gcr.io/google_containers/hostexec:1.2] 13209617} {[gcr.io/google_containers/mounttest-user:0.3] 1718853} {[gcr.io/google_containers/defaultbackend:1.0] 7513643} {[gcr.io/google_containers/mounttest:0.5] 1718853} {[gcr.io/google_containers/fluentd-elasticsearch:1.12] 551579914} {[gcr.io/google_containers/pause:2.0] 350164} {[gcr.io/google_containers/porter:cd5cb5791ebaa8641955f0e8c2a9bed669b1eaab] 5010921} {[gcr.io/google_containers/netexec:1.3.1] 6530936} {[gcr.io/google_containers/goproxy:0.1] 5489816} {[gcr.io/google_containers/portforwardtester:1.0] 2296329} {[gcr.io/google_containers/netexec:1.1] 7280698} {[gcr.io/google_containers/netexec:1.0] 7280393} {[gcr.io/google_samples/gb-frontend:v3] 509707703} {[gcr.io/google_samples/gb-redisslave:v1] 109508753} {[gcr.io/google_containers/nettest:1.6] 10374838} {[gcr.io/google_containers/mounttest:0.2] 1752375} {[gcr.io/google_containers/pause:0.8.0] 241656} {[gcr.io/google_containers/eptest:0.1] 2970692} {[b.gcr.io/k8s_authenticated_test/serve_hostname:1.1 gcr.io/google_containers/serve_hostname:1.1] 4522409} {[gcr.io/google_containers/update-demo:kitten] 4549069} {[gcr.io/google_containers/update-demo:nautilus] 4555533} {[gcr.io/google_containers/nginx:1.7.9] 91664166} {[gcr.io/google_containers/test-webserver:latest] 4534272} {[gcr.io/google_containers/busybox:latest] 2433303} {[gcr.io/google_containers/liveness:latest] 4387474}]}}
11:00:41 Jan 25 11:00:39.227: INFO: 
11:00:41 Logging kubelet events for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.361: INFO: 
11:00:41 Logging pods the kubelet thinks is on node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.571: INFO: heapster-v11-4sgxh started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: fluentd-elasticsearch-gce-soak-weekly-minion-z3wx started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: elasticsearch-logging-v1-1viuc started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: l7-lb-controller-93swe started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.571: INFO: kube-proxy-gce-soak-weekly-minion-z3wx started at <nil> (0 container statuses recorded)
11:00:41 Jan 25 11:00:39.880: INFO: ERROR kubelet_docker_errors{operation_type="create_exec"} => 2 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="info"} => 1 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="inspect_container"} => 4 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="inspect_image"} => 35 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="list_containers"} => 12 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="list_images"} => 5 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="logs"} => 26 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="pull_image"} => 1 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="start_container"} => 50 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="stop_container"} => 6612 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: ERROR kubelet_docker_errors{operation_type="version"} => 18 @[0]
11:00:41 Jan 25 11:00:39.881: INFO: 
11:00:41 Latency metrics for node gce-soak-weekly-minion-z3wx
11:00:41 Jan 25 11:00:39.881: INFO: Waiting up to 1m0s for all nodes to be ready
11:00:41 STEP: Destroying namespace "e2e-tests-pod-disks-dqp13" for this suite.
11:00:41 
11:00:41 • Failure [929.205 seconds]
11:00:41 Pod Disks
11:00:41 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:267
11:00:41   should schedule a pod w/ a RW PD, remove it, then schedule it on another host [It]

@dchen1107
Copy link
Member

Another failure instance for PD test: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-soak-continuous-e2e-gce/4397/ , but different test case:

Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession [Slow]

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pd.go:266 Expected error: <*errors.errorString | 0xc2083d9e50>: { s: "gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s", } gave up waiting for pod 'pd-test-f247fa94-c3bb-11e5-8603-42010af01555' to be 'running' after 15m0s not to have occurred

I guess they might have the same cause, thus not create a new issue. Feel free to create another one though.

@pmorie
Copy link
Member

pmorie commented Jan 26, 2016

@kubernetes/rh-storage

@saad-ali
Copy link
Member

Thanks @dchen1107. Confirmed that both those continuous E2E builds hit the same issue.

kubernetes-soak-continuous-e2e-gce/4396:

E0125 18:49:15.515200    3488 gce_util.go:170] Error attaching PD "gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555": googleapi: Error 400: The disk resource 'projects/kubernetes-jenkins/zones/us-central1-f/disks/gce-soak-weekly-c39c1948-c393-11e5-8a0b-42010af01555' is already being used by 'projects/kubernetes-jenkins/zones/us-central1-f/instances/gce-soak-weekly-minion-atak'

Basically this bug exposes the house of cards that is the current state of the attach/detach logic in kubelet. In the existing code there is a race condition between the asynchronous pod creation loop and the orphaned pods clean up loop. Although there is some logic to make sure that the actual attach/detach operations don't step on each others' toes, their is basically no synchronization between the remaining logic in the two loops responsible for setting up and tearing down volumes, so the loops end up racing. As it is currently written, the code "failing attach when the disk is already attached" behavior actually masks a very common race condition that happens when a pod is rapidly created and deleted and recreated: The 2nd attach operation tends to get triggered slightly before the detach operation, and in the current code, it eventually fails out (since being attached to the same node is considered an error), then the detach goes through and pulls the PD, the attach loop comes back around in the meantime retires and then all is well.

If we change just this logic ("disk is already attached errors should result in success not failure"), we'll end up exposing that, far nastier bug, that pulls disks from under users as they are using them: disk will get attached, second attach operation comes in just before the detach and succeeds, detach comes along and pulls the disk while it is in use (looks like data loss to user).

The correct fix is larger: interlock attach/detach such that they do not operate independently of each other. I'm tackling this as part of the larger attach/detach controller redesign, which likely will not be part of 1.2. Therefore a fix for this is unlikely to be part of 1.2.

@jsafrane
Copy link
Member

@saad-ali, this looks very similar to #19602. I fixed it for Cinder, #19953, I guess a hotfix for GCE would look the same. Proper attach/detach controller will hopefuly fix it for all.

@saad-ali
Copy link
Member

There is similar logic on the GCE PD side to prevent the attach/detach operations from interrupting each other, but that does nothing to guarantee the order of operations--which is the key issue here. That will be addressed by unifying the loops.

@dchen1107
Copy link
Member

@thockin
Copy link
Member

thockin commented Jan 28, 2016

IIUC, the FIX is O(hard). Is there a short term mitigation?

@spxtr
Copy link
Contributor

spxtr commented Jan 28, 2016

@saad-ali
Copy link
Member

IIUC, the FIX is O(hard). Is there a short term mitigation?

Looking into it

@thockin
Copy link
Member

thockin commented Feb 1, 2016

any thoughts on this? If the answer is "not until the real fix", that's what we need to know.

@saad-ali
Copy link
Member

saad-ali commented Feb 1, 2016

I played with a bunch of hacks, like skipping the teardown step if there are pending setups in progress for a given volume. But they all end up destabilizing the system even more. I'm tempted to just implement the "real fix", but that seems like a bad idea to get in at the last minute. I'll keep experimenting, and see if I can get something before Friday, but let's go with "not until the real fix" for now.

@bprashanth
Copy link
Contributor

Consider flipping this to a "[Feature:PD]" test if you need time, since it sounds like the problem is well understood

@saad-ali
Copy link
Member

the fix is complicated and will require big changes to the current volume mount/unmount logic

Proposed design here: #21931. Implementation in progress.

@gmarek
Copy link
Contributor

gmarek commented May 17, 2016

@saad-ali - any update on this?

@saad-ali
Copy link
Member

@gmarek lots of work on-going (e.g. #25457) targeted at 1.3

@lavalamp
Copy link
Member

This is breaking often enough that it's bypassing my submit queue optimization - can we move this test to flaky until we get it fixed?

@lavalamp
Copy link
Member

Specifically the test "Pod Disks should schedule a pod w/two RW PDs both mounted to one container, write to PD, verify contents, delete pod, recreate pod, verify contents, and repeat in rapid succession"

@saad-ali
Copy link
Member

can we move this test to flaky until we get it fixed

Done, see PR #26089
There was a PR that went in over the weekend (#21709) that destabalized this test.

k8s-github-robot pushed a commit that referenced this issue Jun 3, 2016
…Changes

Automatic merge from submit-queue

Attach/Detach Controller Kubelet Changes

This PR contains changes to enable attach/detach controller proposed in #20262.

Specifically it:
* Introduces a new `enable-controller-attach-detach` kubelet flag to enable control by attach/detach controller. Default enabled.
* Removes all references `SafeToDetach` annotation from controller.
* Adds the new `VolumesInUse` field to the Node Status API object.
* Modifies the controller to use `VolumesInUse` instead of `SafeToDetach` annotation to gate detachment.
* Modifies kubelet to set `VolumesInUse` before Mount and after Unmount.
  * There is a bug in the `node-problem-detector` binary that causes `VolumesInUse` to get reset to nil every 30 seconds. Issue kubernetes/node-problem-detector#9 (comment) opened to fix that.
  * There is a bug here in the mount/unmount code that prevents resetting `VolumeInUse in some cases, this will be fixed by mount/unmount refactor.
* Have controller process detaches before attaches so that volumes referenced by pods that are rescheduled to a different node are detached first.
* Fix misc bugs in controller.
* Modify GCE attacher to: remove retries, remove mutex, and not fail if volume is already attached or already detached.

Fixes #14642, #19953

```release-note
Kubernetes v1.3 introduces a new Attach/Detach Controller. This controller manages attaching and detaching volumes on-behalf of nodes that have the "volumes.kubernetes.io/controller-managed-attach-detach" annotation.

A kubelet flag, "enable-controller-attach-detach" (default true), controls whether a node sets the "controller-managed-attach-detach" or not.
```
@goltermann
Copy link
Contributor

Is this one now fixed since #26351 went in?

@saad-ali
Copy link
Member

saad-ali commented Jun 6, 2016

Yes, this is fixed with #26351

@saad-ali saad-ali closed this as completed Jun 6, 2016
@orian
Copy link

orian commented Jan 12, 2017

Just got this error.

With 2 nodes cluster. The number of replicas is 1.

I've use deployment type of update: Recreate.

From events I've noticed that pod is created before the the disk get detached from one node and attached to the other one (it gets attached after some time).

The documentation doesn't state if there's a retry mechanism in place.

I've checked the attachment of disk by running:

gcloud compute disks describe somedisk

It may be a Google Container Engine issue.

@jchauncey
Copy link

jchauncey commented Apr 19, 2017

I am also seeing this issue with a 1.5 cluster on GCE

@bbhoss
Copy link

bbhoss commented May 16, 2017

To pile on, I am seeing this issue on a 1.6.2 GKE cluster. We have a persistent group and a preemptible one, and we've started seeing this quite a bit recently. I believe this may be a regression in 1.6, I have seen it happen before but it usually clears up. Now it requires manual intervention.

@shimizuf
Copy link

We're also seeing this on a 1.6.4 GKE cluster. When a pod with a PD is destroyed, the re-created one fails with "The disk resource [...] is already being used by [...]".

@andrewhowdencom
Copy link

Also seeing this. In my case, it's being attached to a pod run by a job, and when that job runs to completion, the disk isn't being detached.

Resource is here: https://github.com/andrewhowdencom/m2onk8s/blob/master/deploy/helm/charts/magento/templates/install.job.yaml

It's being consumed by the deployment in that same folder.

 kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:22:08Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

@bbhoss @shimizuf perhaps we should open a new issue rather than trying to necro this one?

@gm42
Copy link

gm42 commented Jul 7, 2017

Please re-open, this is still happening with v1.6.4

@saad-ali
Copy link
Member

Instead of re-opening this bug, I've created a new bug: #48968 Please carry on conversation there

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jun 10, 2018
Set etcd DialTimeout, fix etcd start order in all-in-one

Origin-commit: 116b179defb1dcbbe5e734efbee1b038a64fc4c7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform/gce kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests