AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

justinsb · 2016-06-16T17:44:10Z

Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.

We hash the volume name so we don't bias to the first few zones.

If the volume name "looks like" a PetSet volume name (ending with
-) then we use the number as an offset. In that case we hash
the base name.

justinsb · 2016-06-17T19:29:48Z

Working on AWS...

I0617 19:22:47.328898       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.328928       6 util.go:223] Detected PetSet-style volume name "datadir-rd-2"; index=2
I0617 19:22:47.328944       6 util.go:236] Creating volume with name "datadir-rd-2"; chose zone="us-west-2a" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]
I0617 19:22:47.330082       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.330101       6 util.go:223] Detected PetSet-style volume name "datadir-rd-0"; index=0
I0617 19:22:47.330117       6 util.go:236] Creating volume with name "datadir-rd-0"; chose zone="us-west-2b" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]
I0617 19:22:47.354228       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.354246       6 util.go:223] Detected PetSet-style volume name "datadir-rd-1"; index=1
I0617 19:22:47.354254       6 util.go:236] Creating volume with name "datadir-rd-1"; chose zone="us-west-2c" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]

NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM                  REASON    AGE       LABELS
pvc-df9c59ff-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-0             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b
pvc-dfa1fd39-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-1             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c
pvc-dfa32aea-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-2             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a

Long term we plan on integrating this into the scheduler, but in the short term we use the volume name to place it onto a zone. We hash the volume name so we don't bias to the first few zones. If the volume name "looks like" a PetSet volume name (ending with -<number>) then we use the number as an offset. In that case we hash the base name. Fixes kubernetes#27256

We had a long-lasting bug which prevented creation of volumes in non-master zones, because the cloudprovider in the volume label admission controller is not initialized with the multizone setting (issue kubernetes#27656). This implements a simple workaround: if the volume is created with the failure-domain zone label, we look for the volume in that zone. This is more efficient, avoids introducing a new semantic, and allows users (and the dynamic provisioner) to create volumes in non-master zones. Fixes kubernetes#27657

justinsb · 2016-06-18T04:10:16Z

And it works on GCE also now... multiple bugs being fixed tonight!

I0618 04:07:15.910998       5 util.go:223] Detected PetSet-style volume name "datadir-rd-2"; index=2
I0618 04:07:15.911034       5 util.go:236] Creating volume for PVC "datadir-rd-2"; chose zone="us-central1-a" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]
I0618 04:07:15.911949       5 util.go:223] Detected PetSet-style volume name "datadir-rd-0"; index=0
I0618 04:07:15.911982       5 util.go:236] Creating volume for PVC "datadir-rd-0"; chose zone="us-central1-b" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]
I0618 04:07:15.913963       5 util.go:223] Detected PetSet-style volume name "datadir-rd-1"; index=1
I0618 04:07:15.913998       5 util.go:236] Creating volume for PVC "datadir-rd-1"; chose zone="us-central1-f" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]

NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM                  REASON    AGE       LABELS
pvc-237b0114-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-0             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b
pvc-2382b0c2-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-1             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f
pvc-23862681-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-2             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a

davidopp · 2016-06-19T19:11:39Z

pkg/volume/util.go

+		hash = h.Sum32()
+	}
+
+	zoneSlice := zones.List()


Does the spreading property depend on zones.List() returning the zones in the same order every time it's called? If so, do we know this will actually be true?

It does depend on it, but List() is sorted: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/sets/string.go#L167

This could probably use a comment though.

The case where it fails is if the set of zone changes while volumes are being created for a PetSet. I think the most likely case here is that we grow a PetSet from N=1 -> N=3 some time after initial creation, and in the meantime the user has also expanded their k8s cluster. However, I don't really know what we can practically do better here in 1.3 (suggestions welcome). We really should look at the set of PVCs and pick a zone based on the other allocations (like we would with pod anti affinity), but those scheduler-like changes seem impractical for 1.3 (and we know we want to do them later, which is why we are favoring the name-based hack over an annotation that might constrain us in future).

I hypothesize though that what we have here will be good enough in practice. I don't think AWS or GCE have any regions with 5 zones. I suspect most HA clusters will be launched into 3 zones, and then the set of zones will not change.

Yeah, that's fine, I was just curious. Thanks for the thorough explanation!

Yeah i think we can simply document around scaling zones while a pet is scaling

bprashanth · 2016-06-19T22:28:50Z

pkg/cloudprovider/providers/gce/gce.go

+	// TODO: Parallelize, although O(zones) so not too bad (N <= 3 typically)
+	for _, zone := range gce.managedZones {
+		// We only retrieve one page in each zone - we only care about existence
+		listCall := gce.service.Instances.List(gce.projectID, zone)


is there a reson we're listing all instances here and only running instances above ? is it simpler to not account for instance state, since one can't guess if all instances will go down right after we allocate the pv anyway

I debated this, and am happy to change. I tried to sum it up in "No filter: we assume that a zone is either used or unused". My view was that here we should probably just keep it simple, and if an instance was starting up we would still count it as a zone in use. The above method is constrained to exclude them, but I thought it probably simpler to just keep this method simple for now (we can always filter out instances that are shutting down later...)

erictune · 2016-06-21T16:35:06Z

@quinton-hoole @bprashanth @pmorie This is marked as a release blocker and the author has a lot of other things on his plate. Can one of the reviewers please give an LGTM if all major issues/questions have been addressed, and leave minor nits for follow up PRs, please.

Lots of comments describing the heuristics, how it fits together and the limitations. In particular, we can't guarantee correct volume placement if the set of zones is changing between allocating volumes.

chrislovecnm · 2016-06-21T19:36:51Z

@bgrant0607 @smarterclayton @saad-ali help ... this is another that is going to cause me heartburn :( Can we get some momentum on review?

k8s-bot · 2016-06-21T20:00:47Z

GCE e2e build/test passed for commit dd94997.

bprashanth · 2016-06-21T21:02:51Z

pkg/cloudprovider/providers/gce/gce.go

+		// volumes and the health of zones.
+
+		// Just a minimal set of fields - we only care about existence
+		listCall = listCall.Fields("items(name)")


@zmerlynn fyi another list call on fields, but I think this ones ok since it's just checking for exit code 0, essentially

This looks fine. The function could be a bit more pessimistic about which zones it uses if the nodeInstancePrefix I added in #27741 wasn't actually advisory, but right now we let people join arbitrary garbage to a master, so this sort of check is about all that you can do.

chrislovecnm · 2016-06-21T21:18:02Z

I gotta get this into a branch that I have to cut for a demo

@bprashanth @zmerlynn ... look good?

@justinsb good to go? Thanks btw ... this would have been a big BOOM for me :P

bprashanth · 2016-06-21T22:22:27Z

I'm fine with this, I only had a bunch of nits. I'd really like the unittests I mentioned so I'm not allowed to silently break this feature, we don't have an e2e either, but I think we can address that through a follow up (filing issue).

k8s-github-robot · 2016-06-22T07:48:34Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-06-22T08:22:14Z

GCE e2e build/test passed for commit dd94997.

k8s-github-robot · 2016-06-22T08:22:16Z

Automatic merge from submit-queue

googlebot added the cla: yes label Jun 16, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 16, 2016

justinsb force-pushed the pvc_zone_spreading_2 branch 4 times, most recently from ac881af to 50456d5 Compare June 16, 2016 19:00

k8s-github-robot assigned pmorie Jun 16, 2016

justinsb force-pushed the pvc_zone_spreading_2 branch 3 times, most recently from 4a2c148 to 3dfce3e Compare June 17, 2016 19:29

justinsb force-pushed the pvc_zone_spreading_2 branch from 3dfce3e to cdc5843 Compare June 18, 2016 02:28

justinsb added 2 commits June 17, 2016 23:27

justinsb force-pushed the pvc_zone_spreading_2 branch from cdc5843 to 9c25665 Compare June 18, 2016 03:28

justinsb added this to the v1.3 milestone Jun 18, 2016

justinsb changed the title ~~WIP: AWS: Spread PetSet volume creation across zones~~ AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones Jun 18, 2016

roberthbailey assigned ghost Jun 18, 2016

davidopp assigned bprashanth Jun 19, 2016

davidopp added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Jun 19, 2016

davidopp reviewed Jun 19, 2016
View reviewed changes

bprashanth reviewed Jun 19, 2016
View reviewed changes

Add comments & misc review fixes

dd94997

Lots of comments describing the heuristics, how it fits together and the limitations. In particular, we can't guarantee correct volume placement if the set of zones is changing between allocating volumes.

bprashanth reviewed Jun 21, 2016
View reviewed changes

bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2016

bprashanth mentioned this pull request Jun 21, 2016

Expand petset volume zone spreading #27809

Closed

chrislovecnm added a commit to chrislovecnm/kubernetes that referenced this pull request Jun 22, 2016

Fixing cherry pick of PR kubernetes#27553 and adding pd-ssd all the time

bb62017

k8s-github-robot merged commit 07471cf into kubernetes:master Jun 22, 2016

davidopp mentioned this pull request Nov 15, 2016

PetSet with multiple PVC #35695

Closed

irfanurrehman mentioned this pull request Oct 30, 2017

Expand petset volume zone spreading kubernetes-retired/federation#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

justinsb commented Jun 16, 2016

justinsb commented Jun 17, 2016

justinsb commented Jun 18, 2016

davidopp Jun 19, 2016

justinsb Jun 19, 2016

davidopp Jun 19, 2016

bprashanth Jun 19, 2016

bprashanth Jun 19, 2016

justinsb Jun 20, 2016

erictune commented Jun 21, 2016

chrislovecnm commented Jun 21, 2016

k8s-bot commented Jun 21, 2016

bprashanth Jun 21, 2016

zmerlynn Jun 21, 2016

chrislovecnm commented Jun 21, 2016 •

edited

Loading

bprashanth commented Jun 21, 2016

k8s-github-robot commented Jun 22, 2016

k8s-bot commented Jun 22, 2016

k8s-github-robot commented Jun 22, 2016

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

Conversation

justinsb commented Jun 16, 2016

justinsb commented Jun 17, 2016

justinsb commented Jun 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented Jun 21, 2016

chrislovecnm commented Jun 21, 2016

k8s-bot commented Jun 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrislovecnm commented Jun 21, 2016 • edited Loading

bprashanth commented Jun 21, 2016

k8s-github-robot commented Jun 22, 2016

k8s-bot commented Jun 22, 2016

k8s-github-robot commented Jun 22, 2016

chrislovecnm commented Jun 21, 2016 •

edited

Loading