-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553
AWS/GCE: Spread PetSet volume creation across zones, create GCE volumes in non-master zones #27553
Conversation
ac881af
to
50456d5
Compare
4a2c148
to
3dfce3e
Compare
Working on AWS...
|
3dfce3e
to
cdc5843
Compare
Long term we plan on integrating this into the scheduler, but in the short term we use the volume name to place it onto a zone. We hash the volume name so we don't bias to the first few zones. If the volume name "looks like" a PetSet volume name (ending with -<number>) then we use the number as an offset. In that case we hash the base name. Fixes kubernetes#27256
We had a long-lasting bug which prevented creation of volumes in non-master zones, because the cloudprovider in the volume label admission controller is not initialized with the multizone setting (issue kubernetes#27656). This implements a simple workaround: if the volume is created with the failure-domain zone label, we look for the volume in that zone. This is more efficient, avoids introducing a new semantic, and allows users (and the dynamic provisioner) to create volumes in non-master zones. Fixes kubernetes#27657
cdc5843
to
9c25665
Compare
And it works on GCE also now... multiple bugs being fixed tonight!
|
hash = h.Sum32() | ||
} | ||
|
||
zoneSlice := zones.List() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the spreading property depend on zones.List() returning the zones in the same order every time it's called? If so, do we know this will actually be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does depend on it, but List() is sorted: https://github.com/kubernetes/kubernetes/blob/master/pkg/util/sets/string.go#L167
This could probably use a comment though.
The case where it fails is if the set of zone changes while volumes are being created for a PetSet. I think the most likely case here is that we grow a PetSet from N=1 -> N=3 some time after initial creation, and in the meantime the user has also expanded their k8s cluster. However, I don't really know what we can practically do better here in 1.3 (suggestions welcome). We really should look at the set of PVCs and pick a zone based on the other allocations (like we would with pod anti affinity), but those scheduler-like changes seem impractical for 1.3 (and we know we want to do them later, which is why we are favoring the name-based hack over an annotation that might constrain us in future).
I hypothesize though that what we have here will be good enough in practice. I don't think AWS or GCE have any regions with 5 zones. I suspect most HA clusters will be launched into 3 zones, and then the set of zones will not change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's fine, I was just curious. Thanks for the thorough explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i think we can simply document around scaling zones while a pet is scaling
// TODO: Parallelize, although O(zones) so not too bad (N <= 3 typically) | ||
for _, zone := range gce.managedZones { | ||
// We only retrieve one page in each zone - we only care about existence | ||
listCall := gce.service.Instances.List(gce.projectID, zone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reson we're listing all instances here and only running instances above ? is it simpler to not account for instance state, since one can't guess if all instances will go down right after we allocate the pv anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated this, and am happy to change. I tried to sum it up in "No filter: we assume that a zone is either used or unused". My view was that here we should probably just keep it simple, and if an instance was starting up we would still count it as a zone in use. The above method is constrained to exclude them, but I thought it probably simpler to just keep this method simple for now (we can always filter out instances that are shutting down later...)
@quinton-hoole @bprashanth @pmorie This is marked as a release blocker and the author has a lot of other things on his plate. Can one of the reviewers please give an LGTM if all major issues/questions have been addressed, and leave minor nits for follow up PRs, please. |
Lots of comments describing the heuristics, how it fits together and the limitations. In particular, we can't guarantee correct volume placement if the set of zones is changing between allocating volumes.
@bgrant0607 @smarterclayton @saad-ali help ... this is another that is going to cause me heartburn :( Can we get some momentum on review? |
GCE e2e build/test passed for commit dd94997. |
// volumes and the health of zones. | ||
|
||
// Just a minimal set of fields - we only care about existence | ||
listCall = listCall.Fields("items(name)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmerlynn fyi another list call on fields, but I think this ones ok since it's just checking for exit code 0, essentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine. The function could be a bit more pessimistic about which zones it uses if the nodeInstancePrefix
I added in #27741 wasn't actually advisory, but right now we let people join arbitrary garbage to a master, so this sort of check is about all that you can do.
I gotta get this into a branch that I have to cut for a demo @bprashanth @zmerlynn ... look good? @justinsb good to go? Thanks btw ... this would have been a big BOOM for me :P |
I'm fine with this, I only had a bunch of nits. I'd really like the unittests I mentioned so I'm not allowed to silently break this feature, we don't have an e2e either, but I think we can address that through a follow up (filing issue). |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit dd94997. |
Automatic merge from submit-queue |
Long term we plan on integrating this into the scheduler, but in the
short term we use the volume name to place it onto a zone.
We hash the volume name so we don't bias to the first few zones.
If the volume name "looks like" a PetSet volume name (ending with
-) then we use the number as an offset. In that case we hash
the base name.