Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553

Merged
merged 3 commits into from
Jun 22, 2016

Conversation

justinsb
Copy link
Member

Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.

We hash the volume name so we don't bias to the first few zones.

If the volume name "looks like" a PetSet volume name (ending with
-) then we use the number as an offset. In that case we hash
the base name.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 16, 2016
@justinsb justinsb force-pushed the pvc_zone_spreading_2 branch 4 times, most recently from ac881af to 50456d5 Compare June 16, 2016 19:00
@justinsb justinsb force-pushed the pvc_zone_spreading_2 branch 3 times, most recently from 4a2c148 to 3dfce3e Compare June 17, 2016 19:29
@justinsb
Copy link
Member Author

Working on AWS...

I0617 19:22:47.328898       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.328928       6 util.go:223] Detected PetSet-style volume name "datadir-rd-2"; index=2
I0617 19:22:47.328944       6 util.go:236] Creating volume with name "datadir-rd-2"; chose zone="us-west-2a" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]
I0617 19:22:47.330082       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.330101       6 util.go:223] Detected PetSet-style volume name "datadir-rd-0"; index=0
I0617 19:22:47.330117       6 util.go:236] Creating volume with name "datadir-rd-0"; chose zone="us-west-2b" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]
I0617 19:22:47.354228       6 aws.go:954] Found instances in zones map[us-west-2a:{} us-west-2b:{} us-west-2c:{}]
I0617 19:22:47.354246       6 util.go:223] Detected PetSet-style volume name "datadir-rd-1"; index=1
I0617 19:22:47.354254       6 util.go:236] Creating volume with name "datadir-rd-1"; chose zone="us-west-2c" from zones=["us-west-2a" "us-west-2b" "us-west-2c"]
NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM                  REASON    AGE       LABELS
pvc-df9c59ff-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-0             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b
pvc-dfa1fd39-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-1             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c
pvc-dfa32aea-34c0-11e6-a694-02263499474d   20Gi       RWO           Bound     default/datadir-rd-2             2m        failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a

@justinsb justinsb force-pushed the pvc_zone_spreading_2 branch from 3dfce3e to cdc5843 Compare June 18, 2016 02:28
justinsb added 2 commits June 17, 2016 23:27
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.

We hash the volume name so we don't bias to the first few zones.

If the volume name "looks like" a PetSet volume name (ending with
-<number>) then we use the number as an offset.  In that case we hash
the base name.

Fixes kubernetes#27256
We had a long-lasting bug which prevented creation of volumes in
non-master zones, because the cloudprovider in the volume label
admission controller is not initialized with the multizone setting
(issue kubernetes#27656).

This implements a simple workaround: if the volume is created with the
failure-domain zone label, we look for the volume in that zone.  This is
more efficient, avoids introducing a new semantic, and allows users (and
the dynamic provisioner) to create volumes in non-master zones.

Fixes kubernetes#27657
@justinsb justinsb force-pushed the pvc_zone_spreading_2 branch from cdc5843 to 9c25665 Compare June 18, 2016 03:28
@justinsb
Copy link
Member Author

And it works on GCE also now... multiple bugs being fixed tonight!

I0618 04:07:15.910998       5 util.go:223] Detected PetSet-style volume name "datadir-rd-2"; index=2
I0618 04:07:15.911034       5 util.go:236] Creating volume for PVC "datadir-rd-2"; chose zone="us-central1-a" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]
I0618 04:07:15.911949       5 util.go:223] Detected PetSet-style volume name "datadir-rd-0"; index=0
I0618 04:07:15.911982       5 util.go:236] Creating volume for PVC "datadir-rd-0"; chose zone="us-central1-b" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]
I0618 04:07:15.913963       5 util.go:223] Detected PetSet-style volume name "datadir-rd-1"; index=1
I0618 04:07:15.913998       5 util.go:236] Creating volume for PVC "datadir-rd-1"; chose zone="us-central1-f" from zones=["us-central1-a" "us-central1-b" "us-central1-f"]
NAME                                       CAPACITY   ACCESSMODES   STATUS    CLAIM                  REASON    AGE       LABELS
pvc-237b0114-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-0             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b
pvc-2382b0c2-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-1             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f
pvc-23862681-350a-11e6-964c-42010a800002   20Gi       RWO           Bound     default/datadir-rd-2             1m        failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a

@justinsb justinsb added this to the v1.3 milestone Jun 18, 2016
@justinsb justinsb changed the title WIP: AWS: Spread PetSet volume creation across zones AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones Jun 18, 2016
@roberthbailey roberthbailey assigned ghost Jun 18, 2016
@davidopp davidopp added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Jun 19, 2016
hash = h.Sum32()
}

zoneSlice := zones.List()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the spreading property depend on zones.List() returning the zones in the same order every time it's called? If so, do we know this will actually be true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does depend on it, but List() is sorted: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/sets/string.go#L167

This could probably use a comment though.

The case where it fails is if the set of zone changes while volumes are being created for a PetSet. I think the most likely case here is that we grow a PetSet from N=1 -> N=3 some time after initial creation, and in the meantime the user has also expanded their k8s cluster. However, I don't really know what we can practically do better here in 1.3 (suggestions welcome). We really should look at the set of PVCs and pick a zone based on the other allocations (like we would with pod anti affinity), but those scheduler-like changes seem impractical for 1.3 (and we know we want to do them later, which is why we are favoring the name-based hack over an annotation that might constrain us in future).

I hypothesize though that what we have here will be good enough in practice. I don't think AWS or GCE have any regions with 5 zones. I suspect most HA clusters will be launched into 3 zones, and then the set of zones will not change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine, I was just curious. Thanks for the thorough explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think we can simply document around scaling zones while a pet is scaling

// TODO: Parallelize, although O(zones) so not too bad (N <= 3 typically)
for _, zone := range gce.managedZones {
// We only retrieve one page in each zone - we only care about existence
listCall := gce.service.Instances.List(gce.projectID, zone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reson we're listing all instances here and only running instances above ? is it simpler to not account for instance state, since one can't guess if all instances will go down right after we allocate the pv anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated this, and am happy to change. I tried to sum it up in "No filter: we assume that a zone is either used or unused". My view was that here we should probably just keep it simple, and if an instance was starting up we would still count it as a zone in use. The above method is constrained to exclude them, but I thought it probably simpler to just keep this method simple for now (we can always filter out instances that are shutting down later...)

@erictune
Copy link
Member

@quinton-hoole @bprashanth @pmorie This is marked as a release blocker and the author has a lot of other things on his plate. Can one of the reviewers please give an LGTM if all major issues/questions have been addressed, and leave minor nits for follow up PRs, please.

Lots of comments describing the heuristics, how it fits together and the
limitations.

In particular, we can't guarantee correct volume placement if the set of
zones is changing between allocating volumes.
@chrislovecnm
Copy link
Contributor

@bgrant0607 @smarterclayton @saad-ali help ... this is another that is going to cause me heartburn :( Can we get some momentum on review?

@k8s-bot
Copy link

k8s-bot commented Jun 21, 2016

GCE e2e build/test passed for commit dd94997.

// volumes and the health of zones.

// Just a minimal set of fields - we only care about existence
listCall = listCall.Fields("items(name)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmerlynn fyi another list call on fields, but I think this ones ok since it's just checking for exit code 0, essentially

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. The function could be a bit more pessimistic about which zones it uses if the nodeInstancePrefix I added in #27741 wasn't actually advisory, but right now we let people join arbitrary garbage to a master, so this sort of check is about all that you can do.

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Jun 21, 2016

I gotta get this into a branch that I have to cut for a demo

@bprashanth @zmerlynn ... look good?

@justinsb good to go? Thanks btw ... this would have been a big BOOM for me :P

@bprashanth bprashanth added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 21, 2016
@bprashanth
Copy link
Contributor

I'm fine with this, I only had a bunch of nits. I'd really like the unittests I mentioned so I'm not allowed to silently break this feature, we don't have an e2e either, but I think we can address that through a follow up (filing issue).

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Jun 22, 2016

GCE e2e build/test passed for commit dd94997.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants