-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PetSet with multiple PVC #35695
Comments
/cc @kubernetes/sig-apps |
How does your cluster have nodes in multiple zones? Are you using GKE or GCE? I saw you said GCE, but just want to clarify that you are not using GKE. Are you using kubernetes federation, or do you have one cluster into which you added multiple zones? |
GCE, one cluster with multiple zones, configured by following k8s multizone tutorial. On 27 Oct 2016, 19:09 +0300, Eric Tune notifications@github.com, wrote:
|
I am having what seems to be the same problem using K8s in AWS.
Creating a petset with two dynamic PVCs causes the two AWS ELB volumes to be created in different Availability Zones which of course causes my pod to fail scheduling with the following error:
|
#27553 is relevant. But yeah, I'm not sure the "two volumes mounted from the same pod" case is handled. I guess you want the dynamic PV provisioner to somehow know which set of PVs should be co-located in the same zone because they will be used together by some pod? cc/ @justinsb @bprashanth |
Yes I just realised that the 'volumeClaimTemplates' are under .spec, not .containers, so you could potentially create several containers each using one disk each. But in that case wouldn't all of those containers still be scheduled on the same node anyway since they're in the same pod? Would they ever be scheduled across nodes or AZs? Does it ever make sense to create disks across AZs if every instance of that petset has to be scheduled on the same node anyway? I apologise if I'm missing something, I've only worked with Kubernetes for a few months. |
I also seem to be experiencing this issue. We run a cluster across multiple pv's with kops and we have a petset with 2 volume claims and it always schedules the two pv's in different volumes, thus the pod can never be schechuled. The funny this is that we do have a 2 pvc petset that doesn't have this issue, and the yml's seem pretty much the same. |
By default, PVs are dynamically provisioned in random zones, so you may get lucky a get two PVs for a single pod in the same zone and the pod can be scheduled. More often you get two PVs in two different zones and a pod that uses them both is not schedulable, at least on GCE. Question is how to fix this. We could mark PVCs created from a pet set with an annotation (say petset name + pet index) so the dynamic provisioner could create all PVs with the same petset name & index in the same zone, however it would need some changes in petset controller and all provisioners. |
If someone wanted to steer PVCs on their own, how would they do it? Would
prefer this solution be general enough that anyone doing bulk PVC creation
(like a federation server) can do proper zone steering without having to
have custom hacks.
…On Tue, Jan 10, 2017 at 8:58 AM, Jan Šafránek ***@***.***> wrote:
By default, PVs are dynamically provisioned in random zones, so you may
get lucky a get two PVs for a single pod in the same zone and the pod can
be scheduled. More often you get two PVs in two different zones and a pod
that uses them both is not schedulable, at least on GCE.
Question is how to fix this. We *could* mark PVCs created from a pet set
with an annotation (say petset name + pet index) so the dynamic provisioner
could create all PVs with the same petset name & index in the same zone,
however it would need some changes in petset controller and all
provisioners.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#35695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pyG23pZSrAeKix6pHi1bXtkb0xPHks5rQ46VgaJpZM4KiWCY>
.
|
I plan to allow users to configure a zone/region in PVCs, see kubernetes/community#247, however it's per PVC. If PVCs are created from a template (e.g. in a PetSet), all of them will either end up in the same zone (if it was configured so in the template) or they end up in a random zone (no configuration was present there). Dynamic provisioning is already PetSet-aware, it actively tries to provision PVCs for individual pets in different zones: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util.go#L287 |
@jsafrane, The eventual goal is to not have any PetSet (StatefulSet) specific code, so that the controller is forkable. Have we considered a way to have |
@jsafrane I also suggest to filter out those zones where the petset (or other object) pods cannot be scheduled to in the chosezone function. |
@foxish, if there is no StatefulSet specific code in provisioning code then the StatefulSet controller must choose the zone. In #37497 we implement the code in provisioners that would allow it, however then the StatefulSet controller needs to get the list of zones from somewhere. And it's possible that the list is only in StorageClass. |
Users shouldn't have to read storage classes in order to create PVCs that
spread appropriately. We've already decided that PVC's have labels, that
probably implies that we have to have inter affinity rules and anti
affinity rules on PVCs.
…On Thu, Feb 2, 2017 at 1:17 PM, Jan Šafránek ***@***.***> wrote:
@foxish <https://github.com/foxish>, if there is no StatefulSet specific
code in provisioning code then the StatefulSet controller must choose the
zone. In #37497 <#37497> we
implement the code in provisioners that would allow it, however then the
StatefulSet controller needs to get the list of zones from somewhere. And
it's possible that the list is only in StorageClass.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#35695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p7FfVAY3YE7DJPu8jI9-ORW1QYSlks5rYh3HgaJpZM4KiWCY>
.
|
Assuming something like kubernetes/community#306, where PVCs and PVs are used to manage local storage we need a clear way to specify that PVCs need to bind to local PVs on the same host. This will be important for users that want to run storage systems that can benefit from having separate disks for their WALs and/ or use multiple disks for their data (e.g. ZooKeeper, Cassandra, Kafka). Required PVC affinity seems like a clear way to specify this. |
I think we need a data gravity discussion here - this is pretty deep into
@kubernetes/sig-scheduling-misc's responsibilties w.r.t. deciding how
scheduling decisions that depend on data gravity manifest. StatefulSet
controller should not have to do magic in order for data gravity to take
effect.
…On Thu, Feb 2, 2017 at 2:32 PM, Kenneth Owens ***@***.***> wrote:
Assuming something like kubernetes/community#306
<kubernetes/community#306>, where PVCs and PVs
are used to manage local storage we need a clear way to specify that PVCs
need to bind to local PVs on the same host. This will be important for
users that want to run storage systems that can benefit from having
separate disks for their WALs and/ or use multiple disks for their data
(e.g. ZooKeeper, Cassandra, Kafka). Required PVC affinity seems like a
clear way to specify this.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#35695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p5aR_wudEmfaqL8RS4q8xE51BTqWks5rYi9JgaJpZM4KiWCY>
.
|
As @jsafrane pointed out, the zone selection is actually not random for StatefulSets: We round robin around the available zones so that not all the StatefulSet pets land in the same zone. Line 287 in 2cb17cc
This is very much the scheduler's department, and we merged the PR knowing it had limitations that really could only be addressed by the scheduler, but that these were not imminent. So we opted away from doing intrusive changes such as labels, because this would make life harder for the scheduler. The zone selection is heuristic, because we don't have accurate information - we do the best we can, but we do have a bug where we consider the master's zone, and some users run their masters in separate zones from their nodes. We also have bugs where placement won't be accurate if the set of zones is actively changing. However, fixing this requires figuring out how we identify a master, which has proved difficult - we can't really use Schedulability because the entire notion of schedulability is being replaced by taints. So we would need to know the pod for which we are creating the volume and we're firmly into scheduler territory again... But ... this particular problem (2 volumes on the same pet) is my fault - I had simply not considered it. However, I think the fix could be simple. Per the comment in the code:
i.e. we currently observe names that match the pattern "ClaimName-StatefulSetName-Id" and hash the prefix "ClaimName-StatefulSetName-", and then add the index to key into the list of zones. The fix would be to hash only by "StatefulSetName" so that I'll send a PR right away. I suspect this mini essay will be longer. I do consider that this is tacky mustaches-and-capes magic, and should be replaced with data gravity when the scheduler can do it. |
Idea: we should have a periodic (daily?) e2e run that does the (1) full test suite including volumes in (2) HA with (3) multizone. ( There's actually a whole grid of these I would like to try out (e.g. now that it is just another flag to activate weave / calico / kopeio-networking, we should test overlay networks, and private networks!). But I suggest this would be a great place to start. @zmerlynn - can we make it happen? |
We have some heuristics that ensure that volumes (and hence stateful set pods) are spread out across zones. Sadly they forgot to account for multiple mounts. This PR updates the heuristic to ignore the mount name when we see something that looks like a statefulset volume, thus ensuring that multiple mounts end up in the same AZ. Fix kubernetes#35695
We have some heuristics that ensure that volumes (and hence stateful set pods) are spread out across zones. Sadly they forgot to account for multiple mounts. This PR updates the heuristic to ignore the mount name when we see something that looks like a statefulset volume, thus ensuring that multiple mounts end up in the same AZ. Fix kubernetes#35695
We create a StatefulSet with two volumes per pod. We reuse the zookeeper StatefulSet, but we don't currently use the second volume. That will need to be fixed in the zookeeper image in the contrib repo. However, this should be enough to detect issue kubernetes#35695.
Automatic merge from submit-queue (batch tested with PRs 41667, 41820, 40910, 41645, 41361) Allow multiple mounts in StatefulSet volume zone placement We have some heuristics that ensure that volumes (and hence stateful set pods) are spread out across zones. Sadly they forgot to account for multiple mounts. This PR updates the heuristic to ignore the mount name when we see something that looks like a statefulset volume, thus ensuring that multiple mounts end up in the same AZ. Fix #35695 ```release-note Fix zone placement heuristics so that multiple mounts in a StatefulSet pod are created in the same zone ```
Kubernetes version (use
kubectl version
):Environment:
What happened:
Creating a PetSet with multiple PVC allocates volumes in different zones, then pod scheduling fails. (volume a is in zone a, volume b in zone b, no node in both zones at the same time)
What you expected to happen:
Allocate volumes in the same zone.
How to reproduce it (as minimally and precisely as possible):
Anything else do we need to know:
The text was updated successfully, but these errors were encountered: