Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PetSet with multiple PVC #35695

Closed
nvartolomei opened this issue Oct 27, 2016 · 19 comments
Closed

PetSet with multiple PVC #35695

nvartolomei opened this issue Oct 27, 2016 · 19 comments
Labels
area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@nvartolomei
Copy link
Contributor

nvartolomei commented Oct 27, 2016

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.4+3b417cc", GitCommit:"3b417cc4ccd1b8f38ff9ec96bb50a81ca0ea9d56", GitTreeState:"not a git tree", BuildDate:"2016-10-21T22:34:07Z", GoVersion:"go1.7.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.3", GitCommit:"4957b090e9a4f6a68b4a40375408fdc74a212260", GitTreeState:"clean", BuildDate:"2016-10-16T06:20:04Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GCE

What happened:
Creating a PetSet with multiple PVC allocates volumes in different zones, then pod scheduling fails. (volume a is in zone a, volume b in zone b, no node in both zones at the same time)

What you expected to happen:
Allocate volumes in the same zone.

How to reproduce it (as minimally and precisely as possible):

apiVersion: apps/v1alpha1
kind: PetSet
metadata:
  name: monitoring-influxdb-grafana-v4
  namespace: kube-system
  labels: 
    k8s-app: influxGrafana
    version: v4
    kubernetes.io/cluster-service: "true"
spec: 
  serviceName: monitoring-influxdb
  replicas: 1
  template: 
    metadata: 
      labels: 
        k8s-app: influxGrafana
        version: v4
        kubernetes.io/cluster-service: "true"
    spec: 
      containers: 
        - image: gcr.io/google_containers/heapster_influxdb:v0.7
          name: influxdb
          resources:
            # keep request = limit to keep this container in guaranteed class
            limits:
              cpu: 100m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 500Mi
          ports: 
            - containerPort: 8083
            - containerPort: 8086
          volumeMounts:
          - name: influxdb-ps
            mountPath: /data
        - image: gcr.io/kuende-staging1/heapster_grafana:v3.1.1
          name: grafana
          env:
          resources:
            # keep request = limit to keep this container in guaranteed class
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 100m
              memory: 100Mi
          env:
            # This variable is required to setup templates in Grafana.
            - name: INFLUXDB_SERVICE_URL
              value: http://monitoring-influxdb:8086
              # The following env variables are required to make Grafana accessible via
              # the kubernetes api-server proxy. On production clusters, we recommend
              # removing these env variables, setup auth for grafana, and expose the grafana
              # service using a LoadBalancer or a public IP.
            - name: GF_AUTH_BASIC_ENABLED
              value: "false"
            - name: GF_AUTH_ANONYMOUS_ENABLED
              value: "true"
            - name: GF_AUTH_ANONYMOUS_ORG_ROLE
              value: Admin
            - name: GF_SERVER_ROOT_URL
              value: /api/v1/proxy/namespaces/kube-system/services/monitoring-grafana/
          volumeMounts:
          - name: grafana-ps
            mountPath: /var
  volumeClaimTemplates:
    - metadata:
        name: influxdb-ps
        annotations:
          volume.beta.kubernetes.io/storage-class: ssd
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: grafana-ps
        annotations:
          volume.beta.kubernetes.io/storage-class: standard
      spec:
        accessModes:
          - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi

Anything else do we need to know:

@nvartolomei nvartolomei changed the title PetSet with multiple volumes PetSet with multiple PVC Oct 27, 2016
@foxish foxish added sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/controller-manager and removed team/cluster area/kubectl labels Oct 27, 2016
@foxish
Copy link
Contributor

foxish commented Oct 27, 2016

/cc @kubernetes/sig-apps

@erictune
Copy link
Member

How does your cluster have nodes in multiple zones? Are you using GKE or GCE? I saw you said GCE, but just want to clarify that you are not using GKE.

Are you using kubernetes federation, or do you have one cluster into which you added multiple zones?

@nvartolomei
Copy link
Contributor Author

GCE, one cluster with multiple zones, configured by following k8s multizone tutorial.

On 27 Oct 2016, 19:09 +0300, Eric Tune notifications@github.com, wrote:

How does your cluster have nodes in multiple zones? Are you using GKE or GCE? I saw you said GCE, but just want to clarify that you are not using GKE.

Are you using kubernetes federation, or do you have one cluster into which you added multiple zones?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#35695 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAhJ2Rf-1NS5KQJm_H_buhBwJ_K-B5y0ks5q4MyxgaJpZM4KiWCY).

@leosunmo
Copy link

I am having what seems to be the same problem using K8s in AWS.

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.5", GitCommit:"5a0a696437ad35c133c0c8493f7e9d22b0f9b81b", GitTreeState:"clean", BuildDate:"2016-10-29T01:38:40Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.5", GitCommit:"5a0a696437ad35c133c0c8493f7e9d22b0f9b81b", GitTreeState:"clean", BuildDate:"2016-10-29T01:32:42Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Creating a petset with two dynamic PVCs causes the two AWS ELB volumes to be created in different Availability Zones which of course causes my pod to fail scheduling with the following error:

1m  1m  1   {default-scheduler }        Warning FailedScheduling    pod (couchbase-0) failed to fit in any node
fit failure on node (ip-10-106-200-200.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-210-173.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-200-127.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-28.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-210-53.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-134.ec2.internal): NoVolumeZoneConflict
fit failure on node (ip-10-106-220-145.ec2.internal): NoVolumeZoneConflict

@davidopp
Copy link
Member

#27553 is relevant.

But yeah, I'm not sure the "two volumes mounted from the same pod" case is handled. I guess you want the dynamic PV provisioner to somehow know which set of PVs should be co-located in the same zone because they will be used together by some pod?

cc/ @justinsb @bprashanth

@leosunmo
Copy link

leosunmo commented Nov 15, 2016

Yes I just realised that the 'volumeClaimTemplates' are under .spec, not .containers, so you could potentially create several containers each using one disk each. But in that case wouldn't all of those containers still be scheduled on the same node anyway since they're in the same pod? Would they ever be scheduled across nodes or AZs? Does it ever make sense to create disks across AZs if every instance of that petset has to be scheduled on the same node anyway?

I apologise if I'm missing something, I've only worked with Kubernetes for a few months.

@Brian-McM
Copy link

I also seem to be experiencing this issue. We run a cluster across multiple pv's with kops and we have a petset with 2 volume claims and it always schedules the two pv's in different volumes, thus the pod can never be schechuled. The funny this is that we do have a 2 pvc petset that doesn't have this issue, and the yml's seem pretty much the same.

@jsafrane
Copy link
Member

By default, PVs are dynamically provisioned in random zones, so you may get lucky a get two PVs for a single pod in the same zone and the pod can be scheduled. More often you get two PVs in two different zones and a pod that uses them both is not schedulable, at least on GCE.

Question is how to fix this. We could mark PVCs created from a pet set with an annotation (say petset name + pet index) so the dynamic provisioner could create all PVs with the same petset name & index in the same zone, however it would need some changes in petset controller and all provisioners.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 10, 2017 via email

@jsafrane
Copy link
Member

I plan to allow users to configure a zone/region in PVCs, see kubernetes/community#247, however it's per PVC. If PVCs are created from a template (e.g. in a PetSet), all of them will either end up in the same zone (if it was configured so in the template) or they end up in a random zone (no configuration was present there).

Dynamic provisioning is already PetSet-aware, it actively tries to provision PVCs for individual pets in different zones: https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util.go#L287
The code is not count with one pet having multiple PVCs, they may end up in different zones. My suggestion in comment #35695 (comment) would result in a simple change in this code - it would hash PetSet.Name instead of PVC.Name and all PVs for a pet would be in the same zone.

@foxish
Copy link
Contributor

foxish commented Jan 19, 2017

@jsafrane, The eventual goal is to not have any PetSet (StatefulSet) specific code, so that the controller is forkable. Have we considered a way to have volumeClaimTemplates specify the spreading policy, instead of hardcoding that behavior into the PVCs themselves?

@raffaelespazzoli
Copy link

@jsafrane I also suggest to filter out those zones where the petset (or other object) pods cannot be scheduled to in the chosezone function.
For example I have a situation where I have some unschedulable nodes in a zone (and nothing else) and it happened to me to have a dynamic volume be created there, which obviously resulted in an error.

@jsafrane
Copy link
Member

jsafrane commented Feb 1, 2017

@foxish, this random selection of zone for a stateful set member was done by @justinsb, perhaps he should explain why it does need specific code for that. Also, choosing the zone based on AWS/GCE instances has proven to be fragile and we should choose schedulable Kubernetes nodes instead.

@jsafrane
Copy link
Member

jsafrane commented Feb 2, 2017

@foxish, if there is no StatefulSet specific code in provisioning code then the StatefulSet controller must choose the zone. In #37497 we implement the code in provisioners that would allow it, however then the StatefulSet controller needs to get the list of zones from somewhere. And it's possible that the list is only in StorageClass.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 2, 2017 via email

@kow3ns
Copy link
Member

kow3ns commented Feb 2, 2017

Assuming something like kubernetes/community#306, where PVCs and PVs are used to manage local storage we need a clear way to specify that PVCs need to bind to local PVs on the same host. This will be important for users that want to run storage systems that can benefit from having separate disks for their WALs and/ or use multiple disks for their data (e.g. ZooKeeper, Cassandra, Kafka). Required PVC affinity seems like a clear way to specify this.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 2, 2017 via email

@justinsb
Copy link
Member

justinsb commented Feb 3, 2017

As @jsafrane pointed out, the zone selection is actually not random for StatefulSets: We round robin around the available zones so that not all the StatefulSet pets land in the same zone.

func ChooseZoneForVolume(zones sets.String, pvcName string) string {

This is very much the scheduler's department, and we merged the PR knowing it had limitations that really could only be addressed by the scheduler, but that these were not imminent. So we opted away from doing intrusive changes such as labels, because this would make life harder for the scheduler. The zone selection is heuristic, because we don't have accurate information - we do the best we can, but we do have a bug where we consider the master's zone, and some users run their masters in separate zones from their nodes. We also have bugs where placement won't be accurate if the set of zones is actively changing. However, fixing this requires figuring out how we identify a master, which has proved difficult - we can't really use Schedulability because the entire notion of schedulability is being replaced by taints. So we would need to know the pod for which we are creating the volume and we're firmly into scheduler territory again...

But ... this particular problem (2 volumes on the same pet) is my fault - I had simply not considered it.

However, I think the fix could be simple. Per the comment in the code:

// Heuristic to make sure that volumes in a StatefulSet are spread across zones
// StatefulSet PVCs are (currently) named ClaimName-StatefulSetName-Id,
// where Id is an integer index

i.e. we currently observe names that match the pattern "ClaimName-StatefulSetName-Id" and hash the prefix "ClaimName-StatefulSetName-", and then add the index to key into the list of zones. The fix would be to hash only by "StatefulSetName" so that Claim1-SS1-1 and Claim2-SS1-1 would both map to hash("SS1") + 1 but Claim1-SS1-2 would map to hash("SS1") + 2.

I'll send a PR right away. I suspect this mini essay will be longer.

I do consider that this is tacky mustaches-and-capes magic, and should be replaced with data gravity when the scheduler can do it.

@justinsb
Copy link
Member

justinsb commented Feb 3, 2017

Idea: we should have a periodic (daily?) e2e run that does the (1) full test suite including volumes in (2) HA with (3) multizone. (kops create cluster --zones us-east-1b,us-east-1c,us-east-1d --master-count=3 --node-count=3) I am assuming we have an e2e test for multivolume StatefulSet - if not I can write one / some. It would also help a lot with the volume work (cc @jingxu97 @matchstick ).

There's actually a whole grid of these I would like to try out (e.g. now that it is just another flag to activate weave / calico / kopeio-networking, we should test overlay networks, and private networks!). But I suggest this would be a great place to start.

@zmerlynn - can we make it happen?

justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 3, 2017
We have some heuristics that ensure that volumes (and hence stateful set
pods) are spread out across zones.  Sadly they forgot to account for
multiple mounts.  This PR updates the heuristic to ignore the mount name
when we see something that looks like a statefulset volume, thus
ensuring that multiple mounts end up in the same AZ.

Fix kubernetes#35695
justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 19, 2017
We have some heuristics that ensure that volumes (and hence stateful set
pods) are spread out across zones.  Sadly they forgot to account for
multiple mounts.  This PR updates the heuristic to ignore the mount name
when we see something that looks like a statefulset volume, thus
ensuring that multiple mounts end up in the same AZ.

Fix kubernetes#35695
justinsb added a commit to justinsb/kubernetes that referenced this issue Feb 19, 2017
We create a StatefulSet with two volumes per pod.  We reuse the
zookeeper StatefulSet, but we don't currently use the second volume.
That will need to be fixed in the zookeeper image in the contrib repo.

However, this should be enough to detect issue kubernetes#35695.
k8s-github-robot pushed a commit that referenced this issue Feb 24, 2017
Automatic merge from submit-queue (batch tested with PRs 41667, 41820, 40910, 41645, 41361)

Allow multiple mounts in StatefulSet volume zone placement

We have some heuristics that ensure that volumes (and hence stateful set
pods) are spread out across zones.  Sadly they forgot to account for
multiple mounts.  This PR updates the heuristic to ignore the mount name
when we see something that looks like a statefulset volume, thus
ensuring that multiple mounts end up in the same AZ.

Fix #35695

```release-note
Fix zone placement heuristics so that multiple mounts in a StatefulSet pod are created in the same zone
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests