Expand petset volume zone spreading #50

irfanurrehman · 2017-10-30T15:50:47Z

Issue by bprashanth
Tuesday Jun 21, 2016 at 22:28 GMT
Originally opened as kubernetes/kubernetes#27809

We got petset disk zone spreading in and it's really useful. However we left a couple of todos to follow up on:

Don't embed a zone scheduler in the pv provisioner (https://github.com/kubernetes/kubernetes/pull/27553/files#diff-b3d75e3586a2c9a5140cd549861da9c0R2094)
Write a unitest that protects the zone spreding from petset implementation changes (AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones kubernetes/kubernetes#27553 (comment))
Maybe a multi-az e2e with petset before it goes beta?

irfanurrehman · 2017-10-30T15:50:49Z

Comment by chrislovecnm
Thursday Jun 23, 2016 at 02:45 GMT

I am having scaling issues with multiple zone PVCs on GCE. We are getting timeout errors with the PVCs. We have 1000 nodes spread across 3 az in us-central1.

Here is an example of how long the times are taking:

$ kubectl get po
NAME                    READY     STATUS              RESTARTS   AGE
cassandra-analytics-0   1/1       Running             0          12m
cassandra-analytics-1   1/1       Running             0          7m
cassandra-analytics-2   1/1       Running             0          2m
cassandra-analytics-3   0/1       ContainerCreating   0          28s
cassandra-data-0        1/1       Running             0          12m
cassandra-data-1        1/1       Running             0          5m
cassandra-data-2        0/1       ContainerCreating   0          1m

I have seen times for 1-2m po creation in the same zones, but now in multiple zones it is crawling.

$ kubectl describe cassandra-data-1
Events:
  FirstSeen LastSeen    Count   From                    SubobjectPath           Type        Reason      Message
  --------- --------    -----   ----                    -------------           --------    ------      -------
  7m        7m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned cassandra-data-1 to kubernetes-minion-group-x8ev
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedMount Unable to mount volumes for pod "cassandra-data-1_default(5acd02c7-38ea-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulling     pulling image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulled      Successfully pulled image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Created     Created container with docker id 05c294ddd491
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Started     Started container with docker id 05c294ddd491

I am running the last 1.3 with kubernetes/kubernetes#27553 patched in. I am also provisioning ssd pvc, but the volumes are created long before the timeout issues.

cc: @justinsb @bgrant0607 @saad-ali @erictune

irfanurrehman · 2017-10-30T15:50:51Z

Comment by chrislovecnm
Thursday Jun 23, 2016 at 04:14 GMT

@saad-ali / @bprashanth

This seems to be specifically when the Pet / Pod is being deployed in a zone where the master is not in. Some spot checking has shown that pods in central1-b, same as the master, get deployed quickly, which pods in central1-c or central-f, have the Unable to mount volumes for pod error from above.

irfanurrehman · 2017-10-30T15:50:59Z

Comment by chrislovecnm
Thursday Jun 23, 2016 at 05:16 GMT

The sync loop in reconciler.go seems to be running for 2min

I0623 05:10:59.821836    3599 kubelet.go:2528] SyncLoop (ADD, "api"): "cassandra-analytics-31_default(dfbe550a-3900-11e6-b7d0-42010a800002)"
I0623 05:10:59.978531    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:10:59.978599    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:10:59.980959    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:00.480952365 +0000 UTC (durationBeforeRetry 500ms). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.
I0623 05:11:00.078995    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.084887    3599 operation_executor.go:673] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.580234    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:11:00.582425    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:01.582420585 +0000 UTC (durationBeforeRetry 1s). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.

And then it works

I0623 05:15:07.929521    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:07.932210    3599 operation_executor.go:897] Controller successfully attached volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:08.029886    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").

Here is a grep for PVC out of the kubelet.log https://gist.github.com/chrislovecnm/50d722b04f73dcfd92800be48c584efa

This process is taking about 30 secs when the master is in the same zone as the node, otherwise it is taking 2-4 min.

This is a big scaling issue. Compare 30 seconds to 2-4 minutes when you are looking at 1000 pvc to mount.

irfanurrehman · 2017-10-30T15:51:01Z

Comment by chrislovecnm
Thursday Jun 23, 2016 at 17:26 GMT

@saad-ali / @bprashanth you want me to open a separate issue for the scaling challenges?

cc: @bgrant0607

irfanurrehman · 2017-10-30T15:51:09Z

Comment by chrislovecnm
Friday Jun 24, 2016 at 07:14 GMT

Now I ran into another issue with this:

I have 1008 minions (added 8 more because of this problem, and did not help). I am unable to deploy my last two C* instances. I am stuck at 998 :( The last pod get's stuck in pending am getting the following errors:

$ kubectl describe  po cassandra-data-498
<redacted for brevity>
fit failure on node (kubernetes-minion-group-cxmy): Insufficient CPU
fit failure on node (kubernetes-minion-group-zc10): Insufficient CPU
fit failure on node (kubernetes-minion-group-iyit): Insufficient CPU
fit failure on node (kubernetes-minion-group-nnhm): Insufficient CPU
fit failure on node (kubernetes-minion-group-2k68): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-xfnx): Insufficient CPU
fit failure on node (kubernetes-minion-group-hbxt): Insufficient CPU
fit failure on node (kubernetes-minion-group-srdl): Insufficient CPU
fit failure on node (kubernetes-minion-group-fx7b): Insufficient CPU
fit failure on node (kubernetes-minion-group-26wv): Insufficient CPU
fit failure on node (kubernetes-minion-group-nd2g): Insufficient CPU
fit failure on node (kubernetes-minion-group-n2px): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-4ndb): Insufficient CPU
fit failure on node (kubernetes-minion-group-7zf8): Insufficient CPU

I have head room, as I have 8 new nodes in us-central1a that have nothing on them.

Ideas ... kinda urgent, as my Demo is on Tuesday :( Don't think I have the time to recreate all of the C* instances. I have reduced the petset size by one, and increased it by one, and K8s keeps putting the volume into zone c. When that happens the pod just stays in pending.

irfanurrehman · 2017-10-30T15:51:10Z

Comment by bprashanth
Friday Jun 24, 2016 at 16:08 GMT

Im going to guess it's complaining that the pod is asking for a volume in a zone without capacity. Zone spreading is pretty dumb. You probably have the nodes in the wrong zone. If a volume is created in a zone you can't schedule anywhere else. Get the pvs for that pet (looks like they're already created because you scaled up and down -- kubectl get pv | awk '{print $5}' | grep 498 shoudl show you the pv, kubectl get pv PVNAME -o yaml | grep -i zone should show you the zone if it exists) and try create nodes in that zone.

irfanurrehman · 2017-10-30T15:51:19Z

Comment by chrislovecnm
Friday Jun 24, 2016 at 21:24 GMT

Yep ... added another node in the same zone and I got to 1k. I am loosing pets because of app problems, and one went into a CrashLoop state. I deleted the pet, and now I am getting:

$ kubectl describe po cassandra-analytics-112

  FirstSeen LastSeen    Count   From                    SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                    -------------   --------    ------      -------
  8m        8m      1   {default-scheduler }                    Normal      Scheduled   Successfully assigned cassandra-analytics-112 to kubernetes-minion-group-0jg9
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedMount Unable to mount volumes for pod "cassandra-analytics-112_default(13171b3b-3a50-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]

kubelet.log

E0624 21:23:22.929674    3622 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-24 21:25:22.92967033 +0000 UTC (durationBeforeRetry 2m0s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" (volume.spec.Name: "cassandra-analytics") pod "6a577c2c-3a4b-11e6-b7d0-42010a800002" (UID: "6a577c2c-3a4b-11e6-b7d0-42010a800002") with: remove /var/lib/kubelet/pods/6a577c2c-3a4b-11e6-b7d0-42010a800002/volumes/kubernetes.io~gce-pd/pvc-a5f45f53-38f0-11e6-b7d0-42010a800002: directory not empty

So I assume that it is behaving as expected, but I am a dork, and probably should have not deleted the pet.

Options??

irfanurrehman · 2017-10-30T15:51:21Z

cc @bprashanth

fejta-bot · 2018-01-28T18:39:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-05-24T21:41:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-23T22:28:31Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-23T23:14:25Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

irfanurrehman added area/cloudprovider area/federation area/stateful-apps priority/backlog Higher priority than priority/awaiting-more-evidence. sig/multicluster Categorizes an issue or PR as relevant to sig-multicluster. team/cluster (deprecated - do not use) labels Oct 30, 2017

irfanurrehman mentioned this issue Oct 30, 2017

Expand petset volume zone spreading kubernetes/kubernetes#27809

Closed

k8s-ci-robot added the lifecycle/stale label Jan 28, 2018

irfanurrehman removed the lifecycle/stale label Feb 23, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2018

k8s-ci-robot closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand petset volume zone spreading #50

Expand petset volume zone spreading #50

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented May 24, 2018

fejta-bot commented Jun 23, 2018

fejta-bot commented Jul 23, 2018

Expand petset volume zone spreading #50

Expand petset volume zone spreading #50

Comments

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

irfanurrehman commented Oct 30, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented May 24, 2018

fejta-bot commented Jun 23, 2018

fejta-bot commented Jul 23, 2018