Skip to content
This repository has been archived by the owner on Apr 21, 2019. It is now read-only.

Expand petset volume zone spreading #50

Closed
irfanurrehman opened this issue Oct 30, 2017 · 12 comments
Closed

Expand petset volume zone spreading #50

irfanurrehman opened this issue Oct 30, 2017 · 12 comments
Labels
area/cloudprovider area/federation area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/multicluster Categorizes an issue or PR as relevant to sig-multicluster. team/cluster (deprecated - do not use)

Comments

@irfanurrehman
Copy link
Contributor

Issue by bprashanth
Tuesday Jun 21, 2016 at 22:28 GMT
Originally opened as kubernetes/kubernetes#27809


We got petset disk zone spreading in and it's really useful. However we left a couple of todos to follow up on:

  1. Don't embed a zone scheduler in the pv provisioner (https://github.com/kubernetes/kubernetes/pull/27553/files#diff-b3d75e3586a2c9a5140cd549861da9c0R2094)
  2. Write a unitest that protects the zone spreding from petset implementation changes (AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones kubernetes/kubernetes#27553 (comment))
  3. Maybe a multi-az e2e with petset before it goes beta?

@justinsb

@irfanurrehman irfanurrehman added area/cloudprovider area/federation area/stateful-apps priority/backlog Higher priority than priority/awaiting-more-evidence. sig/multicluster Categorizes an issue or PR as relevant to sig-multicluster. team/cluster (deprecated - do not use) labels Oct 30, 2017
@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Thursday Jun 23, 2016 at 02:45 GMT


I am having scaling issues with multiple zone PVCs on GCE. We are getting timeout errors with the PVCs. We have 1000 nodes spread across 3 az in us-central1.

Here is an example of how long the times are taking:

$ kubectl get po
NAME                    READY     STATUS              RESTARTS   AGE
cassandra-analytics-0   1/1       Running             0          12m
cassandra-analytics-1   1/1       Running             0          7m
cassandra-analytics-2   1/1       Running             0          2m
cassandra-analytics-3   0/1       ContainerCreating   0          28s
cassandra-data-0        1/1       Running             0          12m
cassandra-data-1        1/1       Running             0          5m
cassandra-data-2        0/1       ContainerCreating   0          1m

I have seen times for 1-2m po creation in the same zones, but now in multiple zones it is crawling.

$ kubectl describe cassandra-data-1
Events:
  FirstSeen LastSeen    Count   From                    SubobjectPath           Type        Reason      Message
  --------- --------    -----   ----                    -------------           --------    ------      -------
  7m        7m      1   {default-scheduler }                            Normal      Scheduled   Successfully assigned cassandra-data-1 to kubernetes-minion-group-x8ev
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedMount Unable to mount volumes for pod "cassandra-data-1_default(5acd02c7-38ea-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  5m        5m      1   {kubelet kubernetes-minion-group-x8ev}                  Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-data-1"/"default". list of unattached/unmounted volumes=[cassandra-data]
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulling     pulling image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Pulled      Successfully pulled image "gcr.io/aronchick-apollobit/cassandra-debian:v1.1"
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Created     Created container with docker id 05c294ddd491
  3m        3m      1   {kubelet kubernetes-minion-group-x8ev}  spec.containers{cassandra}  Normal      Started     Started container with docker id 05c294ddd491

I am running the last 1.3 with kubernetes/kubernetes#27553 patched in. I am also provisioning ssd pvc, but the volumes are created long before the timeout issues.

cc: @justinsb @bgrant0607 @saad-ali @erictune

@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Thursday Jun 23, 2016 at 04:14 GMT


@saad-ali / @bprashanth

This seems to be specifically when the Pet / Pod is being deployed in a zone where the master is not in. Some spot checking has shown that pods in central1-b, same as the master, get deployed quickly, which pods in central1-c or central-f, have the Unable to mount volumes for pod error from above.

@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Thursday Jun 23, 2016 at 05:16 GMT


The sync loop in reconciler.go seems to be running for 2min

I0623 05:10:59.821836    3599 kubelet.go:2528] SyncLoop (ADD, "api"): "cassandra-analytics-31_default(dfbe550a-3900-11e6-b7d0-42010a800002)"
I0623 05:10:59.978531    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:10:59.978599    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:10:59.980959    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:00.480952365 +0000 UTC (durationBeforeRetry 500ms). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.
I0623 05:11:00.078995    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.084887    3599 operation_executor.go:673] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/default-token-mu6y5" (spec.Name: "default-token-mu6y5") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").
I0623 05:11:00.580234    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
E0623 05:11:00.582425    3599 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-23 05:11:01.582420585 +0000 UTC (durationBeforeRetry 1s). error: Volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002") is not yet attached according to node status.

And then it works

I0623 05:15:07.929521    3599 reconciler.go:179] VerifyControllerAttachedVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:07.932210    3599 operation_executor.go:897] Controller successfully attached volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002")
I0623 05:15:08.029886    3599 reconciler.go:253] MountVolume operation started for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002" (spec.Name: "pvc-9c4c75aa-38f0-11e6-b7d0-42010a800002") to pod "dfbe550a-3900-11e6-b7d0-42010a800002" (UID: "dfbe550a-3900-11e6-b7d0-42010a800002").

Here is a grep for PVC out of the kubelet.log https://gist.github.com/chrislovecnm/50d722b04f73dcfd92800be48c584efa

This process is taking about 30 secs when the master is in the same zone as the node, otherwise it is taking 2-4 min.

This is a big scaling issue. Compare 30 seconds to 2-4 minutes when you are looking at 1000 pvc to mount.

@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Thursday Jun 23, 2016 at 17:26 GMT


@saad-ali / @bprashanth you want me to open a separate issue for the scaling challenges?

cc: @bgrant0607

@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Friday Jun 24, 2016 at 07:14 GMT


Now I ran into another issue with this:

I have 1008 minions (added 8 more because of this problem, and did not help). I am unable to deploy my last two C* instances. I am stuck at 998 :( The last pod get's stuck in pending am getting the following errors:

$ kubectl describe  po cassandra-data-498
<redacted for brevity>
fit failure on node (kubernetes-minion-group-cxmy): Insufficient CPU
fit failure on node (kubernetes-minion-group-zc10): Insufficient CPU
fit failure on node (kubernetes-minion-group-iyit): Insufficient CPU
fit failure on node (kubernetes-minion-group-nnhm): Insufficient CPU
fit failure on node (kubernetes-minion-group-2k68): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-xfnx): Insufficient CPU
fit failure on node (kubernetes-minion-group-hbxt): Insufficient CPU
fit failure on node (kubernetes-minion-group-srdl): Insufficient CPU
fit failure on node (kubernetes-minion-group-fx7b): Insufficient CPU
fit failure on node (kubernetes-minion-group-26wv): Insufficient CPU
fit failure on node (kubernetes-minion-group-nd2g): Insufficient CPU
fit failure on node (kubernetes-minion-group-n2px): NoVolumeZoneConflict
fit failure on node (kubernetes-minion-group-4ndb): Insufficient CPU
fit failure on node (kubernetes-minion-group-7zf8): Insufficient CPU

I have head room, as I have 8 new nodes in us-central1a that have nothing on them.

Ideas ... kinda urgent, as my Demo is on Tuesday :( Don't think I have the time to recreate all of the C* instances. I have reduced the petset size by one, and increased it by one, and K8s keeps putting the volume into zone c. When that happens the pod just stays in pending.

@irfanurrehman
Copy link
Contributor Author

Comment by bprashanth
Friday Jun 24, 2016 at 16:08 GMT


Im going to guess it's complaining that the pod is asking for a volume in a zone without capacity. Zone spreading is pretty dumb. You probably have the nodes in the wrong zone. If a volume is created in a zone you can't schedule anywhere else. Get the pvs for that pet (looks like they're already created because you scaled up and down -- kubectl get pv | awk '{print $5}' | grep 498 shoudl show you the pv, kubectl get pv PVNAME -o yaml | grep -i zone should show you the zone if it exists) and try create nodes in that zone.

@irfanurrehman
Copy link
Contributor Author

Comment by chrislovecnm
Friday Jun 24, 2016 at 21:24 GMT


Yep ... added another node in the same zone and I got to 1k. I am loosing pets because of app problems, and one went into a CrashLoop state. I deleted the pet, and now I am getting:

$ kubectl describe po cassandra-analytics-112

  FirstSeen LastSeen    Count   From                    SubobjectPath   Type        Reason      Message
  --------- --------    -----   ----                    -------------   --------    ------      -------
  8m        8m      1   {default-scheduler }                    Normal      Scheduled   Successfully assigned cassandra-analytics-112 to kubernetes-minion-group-0jg9
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedMount Unable to mount volumes for pod "cassandra-analytics-112_default(13171b3b-3a50-11e6-b7d0-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]
  6m        17s     4   {kubelet kubernetes-minion-group-0jg9}          Warning     FailedSync  Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "cassandra-analytics-112"/"default". list of unattached/unmounted volumes=[cassandra-analytics]

kubelet.log

E0624 21:23:22.929674    3622 goroutinemap.go:155] Operation for "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" failed. No retries permitted until 2016-06-24 21:25:22.92967033 +0000 UTC (durationBeforeRetry 2m0s). error: UnmountVolume.TearDown failed for volume "kubernetes.io/gce-pd/kubernetes-dynamic-pvc-a5f45f53-38f0-11e6-b7d0-42010a800002" (volume.spec.Name: "cassandra-analytics") pod "6a577c2c-3a4b-11e6-b7d0-42010a800002" (UID: "6a577c2c-3a4b-11e6-b7d0-42010a800002") with: remove /var/lib/kubelet/pods/6a577c2c-3a4b-11e6-b7d0-42010a800002/volumes/kubernetes.io~gce-pd/pvc-a5f45f53-38f0-11e6-b7d0-42010a800002: directory not empty

So I assume that it is behaving as expected, but I am a dork, and probably should have not deleted the pet.

Options??

@irfanurrehman
Copy link
Contributor Author

cc @bprashanth

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/cloudprovider area/federation area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/multicluster Categorizes an issue or PR as relevant to sig-multicluster. team/cluster (deprecated - do not use)
Projects
None yet
Development

No branches or pull requests

3 participants