PetSet with multiple PVC #35695

nvartolomei · 2016-10-27T13:24:48Z

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.4+3b417cc", GitCommit:"3b417cc4ccd1b8f38ff9ec96bb50a81ca0ea9d56", GitTreeState:"not a git tree", BuildDate:"2016-10-21T22:34:07Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.3", GitCommit:"4957b090e9a4f6a68b4a40375408fdc74a212260", GitTreeState:"clean", BuildDate:"2016-10-16T06:20:04Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: GCE

What happened:
Creating a PetSet with multiple PVC allocates volumes in different zones, then pod scheduling fails. (volume a is in zone a, volume b in zone b, no node in both zones at the same time)

What you expected to happen:
Allocate volumes in the same zone.

How to reproduce it (as minimally and precisely as possible):

apiVersion: apps/v1alpha1
kind: PetSet
metadata:
  name: monitoring-influxdb-grafana-v4
  namespace: kube-system
  labels: 
    k8s-app: influxGrafana
    version: v4
    kubernetes.io/cluster-service: "true"
spec: 
  serviceName: monitoring-influxdb
  replicas: 1
  template: 
    metadata: 
      labels: 
        k8s-app: influxGrafana
        version: v4
        kubernetes.io/cluster-service: "true"
    spec: 
      containers: 
        - image: gcr.io/google_containers/heapster_influxdb:v0.7
          name: influxdb
          resources:
            # keep request = limit to keep this container in guaranteed class
            limits:
              cpu: 100m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 500Mi
          ports: 
            - containerPort: 8083
            - containerPort: 8086
          volumeMounts:
          - name: influxdb-ps
            mountPath: /data
        - image: gcr.io/kuende-staging1/heapster_grafana:v3.1.1
          name: grafana
          env:
          resources:
            # keep request = limit to keep this container in guaranteed class
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 100m
              memory: 100Mi
          env:
            # This variable is required to setup templates in Grafana.
            - name: INFLUXDB_SERVICE_URL
              value: http://monitoring-influxdb:8086
              # The following env variables are required to make Grafana accessible via
              # the kubernetes api-server proxy. On production clusters, we recommend
              # removing these env variables, setup auth for grafana, and expose the grafana
              # service using a LoadBalancer or a public IP.
            - name: GF_AUTH_BASIC_ENABLED
              value: "false"
            - name: GF_AUTH_ANONYMOUS_ENABLED
              value: "true"
            - name: GF_AUTH_ANONYMOUS_ORG_ROLE
              value: Admin
            - name: GF_SERVER_ROOT_URL
              value: /api/v1/proxy/namespaces/kube-system/services/monitoring-grafana/
          volumeMounts:
          - name: grafana-ps
            mountPath: /var
  volumeClaimTemplates:
    - metadata:
        name: influxdb-ps
        annotations:
          volume.beta.kubernetes.io/storage-class: ssd
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: grafana-ps
        annotations:
          volume.beta.kubernetes.io/storage-class: standard
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi

Anything else do we need to know:

The text was updated successfully, but these errors were encountered:

foxish · 2016-10-27T15:42:23Z

/cc @kubernetes/sig-apps

erictune · 2016-10-27T16:08:28Z

How does your cluster have nodes in multiple zones? Are you using GKE or GCE? I saw you said GCE, but just want to clarify that you are not using GKE.

Are you using kubernetes federation, or do you have one cluster into which you added multiple zones?

nvartolomei · 2016-10-27T16:55:27Z

GCE, one cluster with multiple zones, configured by following k8s multizone tutorial.

On 27 Oct 2016, 19:09 +0300, Eric Tune notifications@github.com, wrote:

How does your cluster have nodes in multiple zones? Are you using GKE or GCE? I saw you said GCE, but just want to clarify that you are not using GKE.

Are you using kubernetes federation, or do you have one cluster into which you added multiple zones?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#35695 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAhJ2Rf-1NS5KQJm_H_buhBwJ_K-B5y0ks5q4MyxgaJpZM4KiWCY).

leosunmo · 2016-11-15T03:22:30Z

I am having what seems to be the same problem using K8s in AWS.

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.5", GitCommit:"5a0a696437ad35c133c0c8493f7e9d22b0f9b81b", GitTreeState:"clean", BuildDate:"2016-10-29T01:38:40Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.5", GitCommit:"5a0a696437ad35c133c0c8493f7e9d22b0f9b81b", GitTreeState:"clean", BuildDate:"2016-10-29T01:32:42Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Creating a petset with two dynamic PVCs causes the two AWS ELB volumes to be created in different Availability Zones which of course causes my pod to fail scheduling with the following error:

1m  1m  1   {default-scheduler }        Warning FailedScheduling    pod (couchbase-0) failed to fit in any node
fit failure on node (ip-10-106-200-200.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-210-173.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-200-127.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-28.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-210-53.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-134.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-145.ec2.internal): NoVolumeZoneConflict

davidopp · 2016-11-15T06:13:42Z

#27553 is relevant.

But yeah, I'm not sure the "two volumes mounted from the same pod" case is handled. I guess you want the dynamic PV provisioner to somehow know which set of PVs should be co-located in the same zone because they will be used together by some pod?

cc/ @justinsb @bprashanth

leosunmo · 2016-11-15T19:47:06Z

Yes I just realised that the 'volumeClaimTemplates' are under .spec, not .containers, so you could potentially create several containers each using one disk each. But in that case wouldn't all of those containers still be scheduled on the same node anyway since they're in the same pod? Would they ever be scheduled across nodes or AZs? Does it ever make sense to create disks across AZs if every instance of that petset has to be scheduled on the same node anyway?

I apologise if I'm missing something, I've only worked with Kubernetes for a few months.

Brian-McM · 2016-11-29T00:32:28Z

I also seem to be experiencing this issue. We run a cluster across multiple pv's with kops and we have a petset with 2 volume claims and it always schedules the two pv's in different volumes, thus the pod can never be schechuled. The funny this is that we do have a 2 pvc petset that doesn't have this issue, and the yml's seem pretty much the same.

jsafrane · 2017-01-10T13:58:35Z

By default, PVs are dynamically provisioned in random zones, so you may get lucky a get two PVs for a single pod in the same zone and the pod can be scheduled. More often you get two PVs in two different zones and a pod that uses them both is not schedulable, at least on GCE.

Question is how to fix this. We could mark PVCs created from a pet set with an annotation (say petset name + pet index) so the dynamic provisioner could create all PVs with the same petset name & index in the same zone, however it would need some changes in petset controller and all provisioners.

smarterclayton · 2017-01-10T15:42:41Z

If someone wanted to steer PVCs on their own, how would they do it? Would prefer this solution be general enough that anyone doing bulk PVC creation (like a federation server) can do proper zone steering without having to have custom hacks.

…

On Tue, Jan 10, 2017 at 8:58 AM, Jan Šafránek ***@***.***> wrote: By default, PVs are dynamically provisioned in random zones, so you may get lucky a get two PVs for a single pod in the same zone and the pod can be scheduled. More often you get two PVs in two different zones and a pod that uses them both is not schedulable, at least on GCE. Question is how to fix this. We *could* mark PVCs created from a pet set with an annotation (say petset name + pet index) so the dynamic provisioner could create all PVs with the same petset name & index in the same zone, however it would need some changes in petset controller and all provisioners. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#35695 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pyG23pZSrAeKix6pHi1bXtkb0xPHks5rQ46VgaJpZM4KiWCY> .

jsafrane · 2017-01-11T09:24:17Z

I plan to allow users to configure a zone/region in PVCs, see kubernetes/community#247, however it's per PVC. If PVCs are created from a template (e.g. in a PetSet), all of them will either end up in the same zone (if it was configured so in the template) or they end up in a random zone (no configuration was present there).

Dynamic provisioning is already PetSet-aware, it actively tries to provision PVCs for individual pets in different zones: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util.go#L287
The code is not count with one pet having multiple PVCs, they may end up in different zones. My suggestion in comment #35695 (comment) would result in a simple change in this code - it would hash PetSet.Name instead of PVC.Name and all PVs for a pet would be in the same zone.

foxish · 2017-01-19T20:14:15Z

@jsafrane, The eventual goal is to not have any PetSet (StatefulSet) specific code, so that the controller is forkable. Have we considered a way to have volumeClaimTemplates specify the spreading policy, instead of hardcoding that behavior into the PVCs themselves?

raffaelespazzoli · 2017-01-31T20:39:36Z

@jsafrane I also suggest to filter out those zones where the petset (or other object) pods cannot be scheduled to in the chosezone function.
For example I have a situation where I have some unschedulable nodes in a zone (and nothing else) and it happened to me to have a dynamic volume be created there, which obviously resulted in an error.

jsafrane · 2017-02-01T11:25:07Z

@foxish, this random selection of zone for a stateful set member was done by @justinsb, perhaps he should explain why it does need specific code for that. Also, choosing the zone based on AWS/GCE instances has proven to be fragile and we should choose schedulable Kubernetes nodes instead.

jsafrane · 2017-02-02T18:17:35Z

@foxish, if there is no StatefulSet specific code in provisioning code then the StatefulSet controller must choose the zone. In #37497 we implement the code in provisioners that would allow it, however then the StatefulSet controller needs to get the list of zones from somewhere. And it's possible that the list is only in StorageClass.

smarterclayton · 2017-02-02T19:02:59Z

Users shouldn't have to read storage classes in order to create PVCs that spread appropriately. We've already decided that PVC's have labels, that probably implies that we have to have inter affinity rules and anti affinity rules on PVCs.

…

On Thu, Feb 2, 2017 at 1:17 PM, Jan Šafránek ***@***.***> wrote: @foxish <https://github.com/foxish>, if there is no StatefulSet specific code in provisioning code then the StatefulSet controller must choose the zone. In #37497 <#37497> we implement the code in provisioners that would allow it, however then the StatefulSet controller needs to get the list of zones from somewhere. And it's possible that the list is only in StorageClass. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#35695 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p7FfVAY3YE7DJPu8jI9-ORW1QYSlks5rYh3HgaJpZM4KiWCY> .

kow3ns · 2017-02-02T19:32:14Z

Assuming something like kubernetes/community#306, where PVCs and PVs are used to manage local storage we need a clear way to specify that PVCs need to bind to local PVs on the same host. This will be important for users that want to run storage systems that can benefit from having separate disks for their WALs and/ or use multiple disks for their data (e.g. ZooKeeper, Cassandra, Kafka). Required PVC affinity seems like a clear way to specify this.

smarterclayton · 2017-02-02T23:04:51Z

I think we need a data gravity discussion here - this is pretty deep into @kubernetes/sig-scheduling-misc's responsibilties w.r.t. deciding how scheduling decisions that depend on data gravity manifest. StatefulSet controller should not have to do magic in order for data gravity to take effect.

…

On Thu, Feb 2, 2017 at 2:32 PM, Kenneth Owens ***@***.***> wrote: Assuming something like kubernetes/community#306 <kubernetes/community#306>, where PVCs and PVs are used to manage local storage we need a clear way to specify that PVCs need to bind to local PVs on the same host. This will be important for users that want to run storage systems that can benefit from having separate disks for their WALs and/ or use multiple disks for their data (e.g. ZooKeeper, Cassandra, Kafka). Required PVC affinity seems like a clear way to specify this. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#35695 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p5aR_wudEmfaqL8RS4q8xE51BTqWks5rYi9JgaJpZM4KiWCY> .

justinsb · 2017-02-03T05:06:32Z

As @jsafrane pointed out, the zone selection is actually not random for StatefulSets: We round robin around the available zones so that not all the StatefulSet pets land in the same zone.

kubernetes/pkg/volume/util.go

Line 287 in 2cb17cc

func ChooseZoneForVolume(zones sets.String, pvcName string) string {

This is very much the scheduler's department, and we merged the PR knowing it had limitations that really could only be addressed by the scheduler, but that these were not imminent. So we opted away from doing intrusive changes such as labels, because this would make life harder for the scheduler. The zone selection is heuristic, because we don't have accurate information - we do the best we can, but we do have a bug where we consider the master's zone, and some users run their masters in separate zones from their nodes. We also have bugs where placement won't be accurate if the set of zones is actively changing. However, fixing this requires figuring out how we identify a master, which has proved difficult - we can't really use Schedulability because the entire notion of schedulability is being replaced by taints. So we would need to know the pod for which we are creating the volume and we're firmly into scheduler territory again...

But ... this particular problem (2 volumes on the same pet) is my fault - I had simply not considered it.

However, I think the fix could be simple. Per the comment in the code:

// Heuristic to make sure that volumes in a StatefulSet are spread across zones
// StatefulSet PVCs are (currently) named ClaimName-StatefulSetName-Id,
// where Id is an integer index

i.e. we currently observe names that match the pattern "ClaimName-StatefulSetName-Id" and hash the prefix "ClaimName-StatefulSetName-", and then add the index to key into the list of zones. The fix would be to hash only by "StatefulSetName" so that Claim1-SS1-1 and Claim2-SS1-1 would both map to hash("SS1") + 1 but Claim1-SS1-2 would map to hash("SS1") + 2.

I'll send a PR right away. I suspect this mini essay will be longer.

I do consider that this is tacky mustaches-and-capes magic, and should be replaced with data gravity when the scheduler can do it.

justinsb · 2017-02-03T06:41:33Z

Idea: we should have a periodic (daily?) e2e run that does the (1) full test suite including volumes in (2) HA with (3) multizone. (kops create cluster --zones us-east-1b,us-east-1c,us-east-1d --master-count=3 --node-count=3) I am assuming we have an e2e test for multivolume StatefulSet - if not I can write one / some. It would also help a lot with the volume work (cc @jingxu97 @matchstick ).

There's actually a whole grid of these I would like to try out (e.g. now that it is just another flag to activate weave / calico / kopeio-networking, we should test overlay networks, and private networks!). But I suggest this would be a great place to start.

@zmerlynn - can we make it happen?

We have some heuristics that ensure that volumes (and hence stateful set pods) are spread out across zones. Sadly they forgot to account for multiple mounts. This PR updates the heuristic to ignore the mount name when we see something that looks like a statefulset volume, thus ensuring that multiple mounts end up in the same AZ. Fix kubernetes#35695

We create a StatefulSet with two volumes per pod. We reuse the zookeeper StatefulSet, but we don't currently use the second volume. That will need to be fixed in the zookeeper image in the contrib repo. However, this should be enough to detect issue kubernetes#35695.

Automatic merge from submit-queue (batch tested with PRs 41667, 41820, 40910, 41645, 41361) Allow multiple mounts in StatefulSet volume zone placement We have some heuristics that ensure that volumes (and hence stateful set pods) are spread out across zones. Sadly they forgot to account for multiple mounts. This PR updates the heuristic to ignore the mount name when we see something that looks like a statefulset volume, thus ensuring that multiple mounts end up in the same AZ. Fix #35695 ```release-note Fix zone placement heuristics so that multiple mounts in a StatefulSet pod are created in the same zone ```

nvartolomei changed the title ~~PetSet with multiple volumes~~ PetSet with multiple PVC Oct 27, 2016

k8s-github-robot added area/kubectl team/cluster labels Oct 27, 2016

foxish added sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/controller-manager and removed team/cluster area/kubectl labels Oct 27, 2016

bgrant0607 added area/stateful-apps and removed area/controller-manager labels Nov 17, 2016

jsafrane mentioned this issue Feb 1, 2017

volumeClaimTemplates with a StatefulSet schedules volumes on different AWS AZs #40631

Closed

justinsb mentioned this issue Feb 3, 2017

Allow multiple mounts in StatefulSet volume zone placement #40910

Merged

justinsb mentioned this issue Feb 9, 2017

Test: Create multivolume StatefulSet test #41180

Closed

kow3ns mentioned this issue Feb 16, 2017

Fix StatefulSet volume provisioning "magic" #41598

Closed

bcorijn mentioned this issue Feb 20, 2017

Multiple AZ-agnostic persistent volumes helm/charts#546

Closed

k8s-github-robot closed this as completed in #40910 Feb 24, 2017

goll mentioned this issue Apr 20, 2017

Issue in cluster with nodes in multiple zones. goll/rabbitmq-k8s#1

Closed

zetaab mentioned this issue Apr 20, 2017

Openstack cinder storageclass should support also random zone names. #44735

Closed

xiaoshengaimm mentioned this issue Sep 22, 2017

LocalVolume not support Multiple VolumeClaims kubernetes-retired/external-storage#380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PetSet with multiple PVC #35695

PetSet with multiple PVC #35695

nvartolomei commented Oct 27, 2016 •

edited

Loading

foxish commented Oct 27, 2016

erictune commented Oct 27, 2016

nvartolomei commented Oct 27, 2016

leosunmo commented Nov 15, 2016

davidopp commented Nov 15, 2016

leosunmo commented Nov 15, 2016 •

edited

Loading

Brian-McM commented Nov 29, 2016

jsafrane commented Jan 10, 2017

smarterclayton commented Jan 10, 2017 via email

jsafrane commented Jan 11, 2017

foxish commented Jan 19, 2017

raffaelespazzoli commented Jan 31, 2017

jsafrane commented Feb 1, 2017

jsafrane commented Feb 2, 2017

smarterclayton commented Feb 2, 2017 via email

kow3ns commented Feb 2, 2017

smarterclayton commented Feb 2, 2017 via email

justinsb commented Feb 3, 2017

justinsb commented Feb 3, 2017

PetSet with multiple PVC #35695

PetSet with multiple PVC #35695

Comments

nvartolomei commented Oct 27, 2016 • edited Loading

foxish commented Oct 27, 2016

erictune commented Oct 27, 2016

nvartolomei commented Oct 27, 2016

leosunmo commented Nov 15, 2016

davidopp commented Nov 15, 2016

leosunmo commented Nov 15, 2016 • edited Loading

Brian-McM commented Nov 29, 2016

jsafrane commented Jan 10, 2017

smarterclayton commented Jan 10, 2017 via email

jsafrane commented Jan 11, 2017

foxish commented Jan 19, 2017

raffaelespazzoli commented Jan 31, 2017

jsafrane commented Feb 1, 2017

jsafrane commented Feb 2, 2017

smarterclayton commented Feb 2, 2017 via email

kow3ns commented Feb 2, 2017

smarterclayton commented Feb 2, 2017 via email

justinsb commented Feb 3, 2017

justinsb commented Feb 3, 2017

nvartolomei commented Oct 27, 2016 •

edited

Loading

leosunmo commented Nov 15, 2016 •

edited

Loading