Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP / PoC: Spread volume creation across zones #27257

Closed
wants to merge 1 commit into from

Conversation

justinsb
Copy link
Member

Strawman / PoC / WIP for #27256

In order to spread volumes across zones, we define a zone hint annotation on
PVCs. The value can either be an explicit zone name, or it can be an
integer value.

The cloud provider (currently just implemented for AWS) looks at the
zone hint and matches it to the zones:

  • If it's an explicit zone name it uses that zone
  • If it's an integer it chooses the Nth zone mod # zones
  • Otherwise it hashes the zone hint or volume name, and uses that as the
    index

A PetSet sets the annotation with the index in the PetSet.

The intention here is that we will generally spread across zones, and if
a PetSet is created the volumes should land in different zones (assuming
there are enough zones).

A shortcoming is that if the set of zones changes, the allocations are
no longer consistent. Further, if this happens to happen in the middle
of a PetSet creation, we might allocate volumes in the same zone when we
didn't have to.

@justinsb
Copy link
Member Author

cc @bprashanth

I also though of another flaw, which is that if we have a large number of AZs we probably don't want always to allocate PetSets in the first N AZs, so we probably should offset by the hash of the name or similar. But that's easy enough to add if we like the approach

In order to spread volumes across zones, we define a zone hint annotation on
PVCs.  The value can either be an explicit zone name, or it can be an
integer value.

The cloud provider (currently just implemented for AWS) looks at the
zone hint and matches it to the zones:

* If it's an explicit zone name it uses that zone
* If it's an integer it chooses the Nth zone mod # zones
* Otherwise it hashes the zone hint or volume name, and uses that as the
index

A PetSet sets the annotation with the index in the PetSet.

The intention here is that we will generally spread across zones, and if
a PetSet is created the volumes should land in different zones (assuming
there are enough zones).

A shortcoming is that if the set of zones changes, the allocations are
no longer consistent.  Further, if this happens to happen in the middle
of a PetSet creation, we might allocate volumes in the same zone when we
didn't have to.
@justinsb justinsb force-pushed the pvc_zone_spreading branch from a691998 to b4f35f7 Compare June 12, 2016 22:08
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Jun 12, 2016
@k8s-bot
Copy link

k8s-bot commented Jun 12, 2016

GCE e2e build/test failed for commit b4f35f7.

Please reference the list of currently known flakes when examining this failure. If you request a re-test, you must reference the issue describing the flake.

@bprashanth
Copy link
Contributor

bprashanth commented Jun 12, 2016

I'm fine with annotations on claims to denote spreading for alpha, but it feels more natural to express spreading the same way with or without volumes (petsets require spreading even if they're backed by emptydir). This feels a little like data gravity, if a pod is scheduled to one node/zone/CP, it should stick to that node/zone/CP becase that's where its data is, but the initial placement is dictated by something else with broader context.

We're essentially embedding a zone scheduler into the petset controller by teaching it to assign a zone index on pvcs. Why not teach it to assign spec.nodeName based on index as well? That said, I haven't dug into how we're already doing this in other places, so maybe we're bound by constraints of the cloud.

@justinsb
Copy link
Member Author

Well I presume there's a scheduler priority function to spread petset members across nodes / zones (like we have with RCs). I think if they are the member of the same service you get this for free anyway.

The problem is that once you attach a volume, there's a Predicate that gets involved which will override any spreading requirements, because volumes can't be attached across zones.

I'm not sure that the data & no-data cases should be treated the same, because in the case of no-data we probably would run two pods in the same zone when a zone goes down, whereas in the case of data we're constrained by the zones in which the volumes exist. But in any case, they can't be treated the same in today's k8s API design :-)

@bprashanth
Copy link
Contributor

Well I presume there's a scheduler priority function to spread petset members across nodes / zones (like we have with RCs). I think if they are the member of the same service you get this for free anyway.

That's pod antiaffinty, you get to say I want no more than 1 pod in any node with this label (the label being zone), or supply a hint that won't actually force rescheduling. The Service spreading today is more best effort.

The problem is that once you attach a volume, there's a Predicate that gets involved which will override any spreading requirements, because volumes can't be attached across zones.

Hmm, so maybe it's just ok to say if you want zone spreading with petset there's only the one relatively dumb way to get it, and it doesn't play well with any other predicates. Especially if we're already doing that. @kubernetes/sig-scheduling thoughts?

@justinsb
Copy link
Member Author

Hmm, so maybe it's just ok to say if you want zone spreading with petset there's only the one relatively dumb way to get it, and it doesn't play well with any other predicates.

Is this saying there's a workaround currently? Or are you saying that one way is this PR? Maybe I've just missed a trick (it's not unlikely!)

@davidopp
Copy link
Member

Could someone clarify whether the PVs you are talking about are created on-demand in the course of scheduling a pet, or are created ahead of time and then the pet references a claim?

In the first case, I agree that the solution is a scheduler policy that spreads pets across zones, and then just make sure that the volumes that are created get created in the same zone as the pet. But IIUC Justin is trying to address the second case, where the PetSet doesn't even exist at the time you create the PV.

@bprashanth
Copy link
Contributor

Could someone clarify whether the PVs you are talking about are created on-demand in the course of scheduling a pet, or are created ahead of time and then the pet references a claim?

Yeah I'm also more concerned about the first. If the user has already created a pv, we must schedule to that zone. Having the dynamic volume provisioner aware of how/where to provision a pv based on where the pod is going complicates things a bit, but feels more extensible (as a gut feeling, not based on practical implications).

There's an annotation on the pvc that gets the attention of a pv provisioner running in the kubecontroller manager (https://github.com/kubernetes/contrib/blob/master/pets/mysql/galera/mysql-galera.yaml#L115). If the user is going to hand create their pvs, they'd just not spcify the annotation on the template. The petset controller only stamps out pvcs.

@justinsb
Copy link
Member Author

justinsb commented Jun 13, 2016

I may have got the wrong end of the stick, but I'd like the "one command" case for PetSet to spread volumes & pods across zones.

Following along with this example https://github.com/kubernetes/contrib/tree/master/pets/zookeeper#bootstrap, my understanding is that kubetl create -f zookeeper.yaml will create 3 PVCs and 3 pods bounds to those PVCs. The PVCs have the volume.alpha.kubernetes.io/storage-class annotation, so will auto provision PVs. Instant Zookeeper cluster, which is great!

Except that in a multi-AZ k8s cluster, any PVs we create will all be in one zone (the active master's zone). So the pods can only be scheduled in that zone.

The strawman solution here allows dynamic PVs to be created in multiple zones with a generalized "hint" annotation, and has PetSets set that hint to the index in the PetSet (and the way the hint works, sequential numbers will round-robin across the available zones).

A side effect of the PR is that users can also specify the same annotation for PVCs outside of PetSets, and even if they don't that we will spread dynamically provisioned PVs (and thus the pods that use them) across zones. But that's a side-benefit from not hard-coding a link between PetSets & multi-zone provisioning.


Having the dynamic volume provisioner aware of how/where to provision a pv based on where the pod is going complicates things a bit, but feels more extensible (as a gut feeling, not based on practical implications).

I don't think the dynamic volume provisioner has any knowledge of the pod, or indeed where that pod will be scheduled? Can you explain a bit more how this would work?

@davidopp
Copy link
Member

davidopp commented Jun 13, 2016

I see.

On the one hand, what you're doing makes sense, and we don't currently have a mechanism to solve the (very reasonable) problem you've identified. On the other hand, this solution is a little weird for the reason @bprashanth mentioned -- you're effectively creating a scheduling policy that doesn't live in the scheduler but which dictates (or at least constrains) where pods can schedule. We've talked about integrating pod and dynamic volume scheduling in the past (I believe the use case in the past was to avoid a situation where the volume can schedule or the pod can schedule, but not both--IIUC both of those are possible today?), and it seems like this is another case where it would be useful. It seems pretty complicated, though.

@justinsb
Copy link
Member Author

justinsb commented Jun 13, 2016

If the long term plan is to integrate pod & volume scheduling, that's great. Perhaps then as a fix until this happens we don't define an annotation, but instead we figure out a side-channel to "smuggle" the PetSet index. I think I should be able to get it from the volume name (but haven't tested it yet).

So the logic would be:

  1. Create the volume in zone zones[i % len(zones)]
  2. In the general case i = hash(name) to avoid creating a hotspot in one zone
  3. If it "looks like" a PetSet volume, we extract out the PetSet volume index and use i = hash(basename) + petSetIndex

It's a hack, but it (should) work and mean we deliver the right results until the great-scheduler-rewrite. Or we can smoothly replace it with an annotation if we want to later, without having to rush into anything now. We're also not making any strong guarantees which will constrain us in future, other than that PetSet volumes normally land in different AZs.

(I'm much happier suggesting this for 1.3 than rushing us into an annotation, even if it is an alpha annotation)

@bprashanth
Copy link
Contributor

Can we just have the scheduler write this annotation when it picks a zone?

@justinsb
Copy link
Member Author

Can you put up a concept PR @bprashanth ?

@davidopp
Copy link
Member

Can we just have the scheduler write this annotation when it picks a zone?

I thought the scheduler runs after the PV auto-provisioner runs, so what you suggested would be too late?

@bprashanth
Copy link
Contributor

I thought that limitation was because of the "we must schedule to the zone this volume is in" issue. I'm asking why we can't reverse that in the case where the user has not hand provisioned the volume.

Can you put up a concept PR @bprashanth ?

Not sure i can get to it for 1.3, will give it some thought.

@bprashanth
Copy link
Contributor

Btw for this to work we need something like data gravity i think

@davidopp davidopp added this to the v1.3 milestone Jun 16, 2016
@davidopp
Copy link
Member

I think I'm OK with this PR since it's alpha so we can rip it out and replace it with a longer-term solution once we figure out what that might be. And it does solve a reasonable problem that @justinsb has pointed out.

@bprashanth what do you think?

@ghost
Copy link

ghost commented Jun 16, 2016

My apologies for chiming in so late @justinsb

  1. I do like the general idea.
  2. Yes, I'd avoid the annotation, for the reasons @justinsb mentions.
  3. And yes, the hotspot avoidance part would be great, although in practise, given that very few cloud providers have many more than 3 zones per region, and very few petsets will have less than 3 pets, I think you'll be reasonably OK even without that.

One thing that worries me a bit is capacity management (e.g. we place the PVC in a zone, are then bound to schedule the pod into that zone, but we can't because it's out of capacity). But I think that we can solve that in a second pass, perhaps together with autoscaling. Both Petset and Multizone Clusters are still Beta, after all.

@bprashanth
Copy link
Contributor

I spoke to Justin offline, I think he's going to modify this pr, add some todos about a better future plan. I really like to just say: petset gives you ordinality, clients can use this as they see fit. I'm fine with a short term alpha workaround.

The plan we discussed was to teach the petset controller to add a label with name: pet-X which is useful in other scenarios too (users can create a service.type=LoadBalancer over that label if they want a true vm experience). This keeps the zone assignment out of the identity of a pet. The volume controller will parse out the label and map it to a zone.

@justinsb
Copy link
Member Author

Opened #27553 with the implementation as discuessed with @bprashanth . It is a similar fix to this one, but without committing us to an annotation (also fixing a more critical issue which blocked any creation of PVs in non-master zones on GCE), so later we can move this to the scheduler so it can schedule pods & volumes together.

I was able to use the PVC.Name as the label. In general we now spread volumes across zones by simply hashing the PVC name. But if the PVC name is indexed (ends with -N) then we try to round-robin them across zones (we add the index to the hash). PetSets produce indexed PVCs.

This doesn't let LoadBalancers target particular pods, but it does fix the general long-standing issue that we create volumes only in the master zone; a bug in GCE that prevented volumes being created in anything but the master zone; and gives users the zone-spreading behaviour they expect from PetSets.

@justinsb justinsb closed this Jun 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants