-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP / PoC: Spread volume creation across zones #27257
Conversation
cc @bprashanth I also though of another flaw, which is that if we have a large number of AZs we probably don't want always to allocate PetSets in the first N AZs, so we probably should offset by the hash of the name or similar. But that's easy enough to add if we like the approach |
In order to spread volumes across zones, we define a zone hint annotation on PVCs. The value can either be an explicit zone name, or it can be an integer value. The cloud provider (currently just implemented for AWS) looks at the zone hint and matches it to the zones: * If it's an explicit zone name it uses that zone * If it's an integer it chooses the Nth zone mod # zones * Otherwise it hashes the zone hint or volume name, and uses that as the index A PetSet sets the annotation with the index in the PetSet. The intention here is that we will generally spread across zones, and if a PetSet is created the volumes should land in different zones (assuming there are enough zones). A shortcoming is that if the set of zones changes, the allocations are no longer consistent. Further, if this happens to happen in the middle of a PetSet creation, we might allocate volumes in the same zone when we didn't have to.
a691998
to
b4f35f7
Compare
GCE e2e build/test failed for commit b4f35f7. Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake. |
I'm fine with annotations on claims to denote spreading for alpha, but it feels more natural to express spreading the same way with or without volumes (petsets require spreading even if they're backed by emptydir). This feels a little like data gravity, if a pod is scheduled to one node/zone/CP, it should stick to that node/zone/CP becase that's where its data is, but the initial placement is dictated by something else with broader context. We're essentially embedding a zone scheduler into the petset controller by teaching it to assign a zone index on pvcs. Why not teach it to assign spec.nodeName based on index as well? That said, I haven't dug into how we're already doing this in other places, so maybe we're bound by constraints of the cloud. |
Well I presume there's a scheduler priority function to spread petset members across nodes / zones (like we have with RCs). I think if they are the member of the same service you get this for free anyway. The problem is that once you attach a volume, there's a Predicate that gets involved which will override any spreading requirements, because volumes can't be attached across zones. I'm not sure that the data & no-data cases should be treated the same, because in the case of no-data we probably would run two pods in the same zone when a zone goes down, whereas in the case of data we're constrained by the zones in which the volumes exist. But in any case, they can't be treated the same in today's k8s API design :-) |
That's pod antiaffinty, you get to say I want no more than 1 pod in any node with this label (the label being zone), or supply a hint that won't actually force rescheduling. The Service spreading today is more best effort.
Hmm, so maybe it's just ok to say if you want zone spreading with petset there's only the one relatively dumb way to get it, and it doesn't play well with any other predicates. Especially if we're already doing that. @kubernetes/sig-scheduling thoughts? |
Is this saying there's a workaround currently? Or are you saying that one way is this PR? Maybe I've just missed a trick (it's not unlikely!) |
Could someone clarify whether the PVs you are talking about are created on-demand in the course of scheduling a pet, or are created ahead of time and then the pet references a claim? In the first case, I agree that the solution is a scheduler policy that spreads pets across zones, and then just make sure that the volumes that are created get created in the same zone as the pet. But IIUC Justin is trying to address the second case, where the PetSet doesn't even exist at the time you create the PV. |
Yeah I'm also more concerned about the first. If the user has already created a pv, we must schedule to that zone. Having the dynamic volume provisioner aware of how/where to provision a pv based on where the pod is going complicates things a bit, but feels more extensible (as a gut feeling, not based on practical implications). There's an annotation on the pvc that gets the attention of a pv provisioner running in the kubecontroller manager (https://github.com/kubernetes/contrib/blob/master/pets/mysql/galera/mysql-galera.yaml#L115). If the user is going to hand create their pvs, they'd just not spcify the annotation on the template. The petset controller only stamps out pvcs. |
I may have got the wrong end of the stick, but I'd like the "one command" case for PetSet to spread volumes & pods across zones. Following along with this example https://github.com/kubernetes/contrib/tree/master/pets/zookeeper#bootstrap, my understanding is that Except that in a multi-AZ k8s cluster, any PVs we create will all be in one zone (the active master's zone). So the pods can only be scheduled in that zone. The strawman solution here allows dynamic PVs to be created in multiple zones with a generalized "hint" annotation, and has PetSets set that hint to the index in the PetSet (and the way the hint works, sequential numbers will round-robin across the available zones). A side effect of the PR is that users can also specify the same annotation for PVCs outside of PetSets, and even if they don't that we will spread dynamically provisioned PVs (and thus the pods that use them) across zones. But that's a side-benefit from not hard-coding a link between PetSets & multi-zone provisioning.
I don't think the dynamic volume provisioner has any knowledge of the pod, or indeed where that pod will be scheduled? Can you explain a bit more how this would work? |
I see. On the one hand, what you're doing makes sense, and we don't currently have a mechanism to solve the (very reasonable) problem you've identified. On the other hand, this solution is a little weird for the reason @bprashanth mentioned -- you're effectively creating a scheduling policy that doesn't live in the scheduler but which dictates (or at least constrains) where pods can schedule. We've talked about integrating pod and dynamic volume scheduling in the past (I believe the use case in the past was to avoid a situation where the volume can schedule or the pod can schedule, but not both--IIUC both of those are possible today?), and it seems like this is another case where it would be useful. It seems pretty complicated, though. |
If the long term plan is to integrate pod & volume scheduling, that's great. Perhaps then as a fix until this happens we don't define an annotation, but instead we figure out a side-channel to "smuggle" the PetSet index. I think I should be able to get it from the volume name (but haven't tested it yet). So the logic would be:
It's a hack, but it (should) work and mean we deliver the right results until the great-scheduler-rewrite. Or we can smoothly replace it with an annotation if we want to later, without having to rush into anything now. We're also not making any strong guarantees which will constrain us in future, other than that PetSet volumes normally land in different AZs. (I'm much happier suggesting this for 1.3 than rushing us into an annotation, even if it is an alpha annotation) |
Can we just have the scheduler write this annotation when it picks a zone? |
Can you put up a concept PR @bprashanth ? |
I thought the scheduler runs after the PV auto-provisioner runs, so what you suggested would be too late? |
I thought that limitation was because of the "we must schedule to the zone this volume is in" issue. I'm asking why we can't reverse that in the case where the user has not hand provisioned the volume.
Not sure i can get to it for 1.3, will give it some thought. |
Btw for this to work we need something like data gravity i think |
I think I'm OK with this PR since it's alpha so we can rip it out and replace it with a longer-term solution once we figure out what that might be. And it does solve a reasonable problem that @justinsb has pointed out. @bprashanth what do you think? |
My apologies for chiming in so late @justinsb
One thing that worries me a bit is capacity management (e.g. we place the PVC in a zone, are then bound to schedule the pod into that zone, but we can't because it's out of capacity). But I think that we can solve that in a second pass, perhaps together with autoscaling. Both Petset and Multizone Clusters are still Beta, after all. |
I spoke to Justin offline, I think he's going to modify this pr, add some todos about a better future plan. I really like to just say: petset gives you ordinality, clients can use this as they see fit. I'm fine with a short term alpha workaround. The plan we discussed was to teach the petset controller to add a label with |
Opened #27553 with the implementation as discuessed with @bprashanth . It is a similar fix to this one, but without committing us to an annotation (also fixing a more critical issue which blocked any creation of PVs in non-master zones on GCE), so later we can move this to the scheduler so it can schedule pods & volumes together. I was able to use the PVC.Name as the label. In general we now spread volumes across zones by simply hashing the PVC name. But if the PVC name is indexed (ends with -N) then we try to round-robin them across zones (we add the index to the hash). PetSets produce indexed PVCs. This doesn't let LoadBalancers target particular pods, but it does fix the general long-standing issue that we create volumes only in the master zone; a bug in GCE that prevented volumes being created in anything but the master zone; and gives users the zone-spreading behaviour they expect from PetSets. |
Strawman / PoC / WIP for #27256
In order to spread volumes across zones, we define a zone hint annotation on
PVCs. The value can either be an explicit zone name, or it can be an
integer value.
The cloud provider (currently just implemented for AWS) looks at the
zone hint and matches it to the zones:
index
A PetSet sets the annotation with the index in the PetSet.
The intention here is that we will generally spread across zones, and if
a PetSet is created the volumes should land in different zones (assuming
there are enough zones).
A shortcoming is that if the set of zones changes, the allocations are
no longer consistent. Further, if this happens to happen in the middle
of a PetSet creation, we might allocate volumes in the same zone when we
didn't have to.