Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error syncing deployment, replica set already exists #29735

Closed
gigaroby opened this issue Jul 28, 2016 · 42 comments · Fixed by #36584
Closed

Error syncing deployment, replica set already exists #29735

gigaroby opened this issue Jul 28, 2016 · 42 comments · Fixed by #36584
Assignees
Labels
area/workload-api/deployment kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Milestone

Comments

@gigaroby
Copy link

gigaroby commented Jul 28, 2016

Env: Kubernetes 1.3.3 on AWS

I ran into the same problem described into #26673 but now it's happening on every deploy.
I would have reopened the issue, but I was not the author so I could not.

The bug presents itself because I have a build that does kubectl rollout status deployment/application-unstable just after a kubectl apply -f <new deployment manifest> and it gets stuck forever because the rollout is never reported as finished (it does finish tho).

Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.615754   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.557964   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.509632   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.456868   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.409374   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.357072   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.306676   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.257030   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.206504   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.157741   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.112898   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.059652   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:27 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:27.007476   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.956218   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.910275   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.857240   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.806429   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.756885   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.706639   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.656417   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.607126   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.557562   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.506025   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.456588   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.422213   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.356907   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.313353   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.257953   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.209894   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.160954   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.107205   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.056387   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
Jul 26 12:42:26 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:26.006022   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-worker-unstable: error creating replica set application-celery-worker-unstable: replicasets "application-celery-worker-unstable-3295993310" already exists
Jul 26 12:42:25 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:25.956830   17798 deployment_controller.go:400] Error syncing deployment default/application-unstable: error creating replica set application-unstable: replicasets "application-unstable-900752437" already exists
Jul 26 12:42:25 ip-x-x-x-x kube-controller-manager[17798]: E0726 12:42:25.906450   17798 deployment_controller.go:400] Error syncing deployment default/application-celery-unstable: error creating replica set application-celery-unstable: replicasets "application-celery-unstable-1891916216" already exists
@adohe-zz
Copy link

@gigaroby could you provide more detail info? like what's your resource file and how to reproduce it.

@gigaroby
Copy link
Author

@adohe yes, of course:
We have an internal tool that downloads some manifests from S3 in a temporary folder and then calls kubectl apply -f temp_dir. It's harder to narrow it down because that's all there is to it.
It caused the issue above for 3/4 builds in a row but in the last build it worked.
The three deployment files are fairly standard deployment templates, this is an example of one of them:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: application-unstable
  labels:
    project: application-unstable
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: application-revenge
        environment: unstable
        tier: frontend
    spec:
      containers:
        - name: uwsgi
          image: my-index.domain.com/application-unstable:{{ version }}
          env: []
          command:
            - "uwsgi"
          args:
            - "--ini"
            - "/etc/uwsgi.ini"
            - "--processes"
            - "1"
            - "--threads"
            - "3"

        - name: openresty
          image: spaziodati/openresty:latest
          ports:
            - containerPort: 8080
              name: http
              protocol: "TCP"
          command:
            - nginx
          args:
            - -c
            - /etc/conf/nginx.conf
          volumeMounts:
            - name: config-volume
              mountPath: /etc/conf
          livenessProbe:
            httpGet:
              path: /status
              port: 8080
            initialDelaySeconds: 10
            timeoutSeconds: 3

      volumes:
        - name: config-volume
          configMap:
            name: application-openresty-conf

The {{ version }} is filled by jenkins at build time before the manifest is uploaded on s3.

I am not sure about how to narrow this down further, any suggestions?

@0xmichalis
Copy link
Contributor

@kubernetes/deployment can somebody triage this?

@arohner
Copy link

arohner commented Oct 12, 2016

We just started seeing this as well. kube 1.3.6 on AWS.

The cluster was running fine for two weeks, then the problem started on the last CI build yesterday, and shows no signs of fixing itself. Our deploy process is a series of kubectl apply -f commands called from a bash script during CI.

@0xmichalis
Copy link
Contributor

#28684 should fix the hotloop by using a rate limiter in the deployment controller. Seems like it was included in 1.3.7. Also the "already exists" shouldn't be treated as an error but instead we should return the replica set from the cache at this point (since it already exists). I was handling this error in a recent PR of mine, will update this issue back with the #

@jsravn
Copy link
Contributor

jsravn commented Oct 13, 2016

@Kargakis Are you sure the fix is in 1.3.7? I just hit this problem in a build of 1.3.7, and it kills the master node as it pegs a cpu at 100% in this error loop.

edit: it doesn't seem to be on the release-1.3 branch, but is in release-1.4

@0xmichalis
Copy link
Contributor

Hm, git fooled me. The fix is included in 1.4.

@jsravn
Copy link
Contributor

jsravn commented Oct 17, 2016

From the code, it looks like deployment api generates an adler-32 checksum of the pod template, then uses that to create a new replicaset deploymentname-<hash>. If that hash already exists, it barfs with the controller-manager loop. So it looks to me, from a quick read, as simply a hash collision handled poorly. This seems to mirror what I'm seeing, since it only happens on deployments with a history >100 deploys. I'm guessing there isn't enough data change in the pod spec to ensure a decent amount of uniqueness with adler-32 - in my case just the version changes. It should probably use a better hash function (that doesn't duplicate on such a small number of deployments), and/or handle the collision more gracefully.

@0xmichalis 0xmichalis self-assigned this Oct 17, 2016
@0xmichalis
Copy link
Contributor

@jsravn I don't think you would get a hash collision even with >100 deploys (not sure where Adler starts to fail though). Even the slightest change in the pod template changes the hash. Can you provide the pod template of the existing replica set that blocks the deployment controller and also the latest state of your deployment?

I think this issue is simply us not handling AlreadyExists errors when we try to create new replica sets (the replica set cache may get stale).

@jsravn
Copy link
Contributor

jsravn commented Oct 18, 2016

@Kargakis sure... I admit I may be wrong - that was my just my super quick analysis. :) Although, adler is pretty terrible as a hash function, and I was a little surprised to see it used - which is why I leapt to that conclusion. I need to try reproducing to see if it's actually the case though - which I'll try now. Here's the replicaset that it complains already exists:

apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
  annotations:
    deployment.kubernetes.io/revision: "25"
  creationTimestamp: 2016-09-22T16:00:52Z
  generation: 3
  labels:
    app: cats
    pod-template-hash: "218308181"
  name: cats-218308181
  namespace: cats-stubbed-functional
  resourceVersion: "3958038"
  selfLink: /apis/extensions/v1beta1/namespaces/cats-stubbed-functional/replicasets/cats-218308181
  uid: bcd1bacb-80dd-11e6-bfcc-0a8cd5340591
spec:
  replicas: 0
  selector:
    matchLabels:
      app: cats
      pod-template-hash: "218308181"
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cats
        pod-template-hash: "218308181"
    spec:
      containers:
      - env:
        - name: DEPLOYMENT_ENVIRONMENT
          value: cats-stubbed-functional
        - name: APP_NAME
          value: cats
        image: registry**obfuscated**/test/cats:v0.535.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /private/status
            port: 9077
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cats
        ports:
        - containerPort: 9077
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /private/status
            port: 9077
            scheme: HTTP
          initialDelaySeconds: 1
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  observedGeneration: 3
  replicas: 0

And the current failing deployment (although, I believe it started failing at generation 204, version 0.612.0).

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "100"
    kubectl.kubernetes.io/last-applied-configuration: '{"kind":"Deployment","apiVersion":"extensions/v1beta1","metadata":{"name":"cats","creationTimestamp":null,"labels":{"app":"cats"}},"spec":{"replicas":1,"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"cats"}},"spec":{"containers":[{"name":"cats","image":"registry**obfuscated**/test/cats:v0.617.0","ports":[{"name":"http","containerPort":9077,"protocol":"TCP"}],"env":[{"name":"DEPLOYMENT_ENVIRONMENT","value":"cats-stubbed-functional"},{"name":"APP_NAME","value":"cats"}],"resources":{"limits":{"cpu":"1","memory":"1Gi"},"requests":{"cpu":"1","memory":"1Gi"}},"livenessProbe":{"httpGet":{"path":"/private/status","port":9077},"initialDelaySeconds":30,"timeoutSeconds":1},"readinessProbe":{"httpGet":{"path":"/private/status","port":9077},"initialDelaySeconds":1,"timeoutSeconds":1}}]}},"strategy":{"type":"RollingUpdate","rollingUpdate":{"maxUnavailable":0,"maxSurge":1}}},"status":{}}'
  creationTimestamp: 2016-09-11T09:39:41Z
  generation: 209
  labels:
    app: cats
  name: cats
  namespace: cats-stubbed-functional
  resourceVersion: "12695928"
  selfLink: /apis/extensions/v1beta1/namespaces/cats-stubbed-functional/deployments/cats
  uid: a9ceb758-7803-11e6-9e34-0a8cd5340591
spec:
  replicas: 1
  rollbackTo: {}
  selector:
    matchLabels:
      app: cats
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: cats
    spec:
      containers:
      - env:
        - name: DEPLOYMENT_ENVIRONMENT
          value: cats-stubbed-functional
        - name: APP_NAME
          value: cats
        image: registry**obfuscated**/test/cats:v0.617.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /private/status
            port: 9077
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: cats
        ports:
        - containerPort: 9077
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /private/status
            port: 9077
            scheme: HTTP
          initialDelaySeconds: 1
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  observedGeneration: 203
  replicas: 1
  updatedReplicas: 1

@jsravn
Copy link
Contributor

jsravn commented Oct 18, 2016

Hash collisions are definitely a problem. I wrote up a test and can get a collision within a couple hundred versions (only version changes). Here's pod_test.go:

package pod

import (
    "testing"
    "strings"
    "strconv"
    "encoding/json"
    "fmt"
    "k8s.io/kubernetes/pkg/api"
)

var podSpec string = `
{
    "metadata": {
        "creationTimestamp": null,
        "labels": {
            "app": "cats"
        }
    },
    "spec": {
        "containers": [
            {
                "name": "cats",
                "image": "registry/test/cats:v0.@@VERSION@@.0",
                "ports": [
                    {
                        "name": "http",
                        "containerPort": 9077,
                        "protocol": "TCP"
                    }
                ],
                "env": [
                    {
                        "name": "DEPLOYMENT_ENVIRONMENT",
                        "value": "cats-stubbed-functional"
                    },
                    {
                        "name": "APP_NAME",
                        "value": "cats"
                    }
                ],
                "resources": {
                    "limits": {
                        "cpu": "1",
                        "memory": "1Gi"
                    },
                    "requests": {
                        "cpu": "1",
                        "memory": "1Gi"
                    }
                },
                "livenessProbe": {
                    "httpGet": {
                        "path": "/private/status",
                        "port": 9077,
                        "scheme": "HTTP"
                    },
                    "initialDelaySeconds": 30,
                    "timeoutSeconds": 1,
                    "periodSeconds": 10,
                    "successThreshold": 1,
                    "failureThreshold": 3
                },
                "readinessProbe": {
                    "httpGet": {
                        "path": "/private/status",
                        "port": 9077,
                        "scheme": "HTTP"
                    },
                    "initialDelaySeconds": 1,
                    "timeoutSeconds": 1,
                    "periodSeconds": 10,
                    "successThreshold": 1,
                    "failureThreshold": 3
                },
                "terminationMessagePath": "/dev/termination-log",
                "imagePullPolicy": "IfNotPresent"
            }
        ],
        "restartPolicy": "Always",
        "terminationGracePeriodSeconds": 30,
        "dnsPolicy": "ClusterFirst",
        "securityContext": {}
    }
}
`

func TestPodTemplateSpecHash(t *testing.T) {
    seenHashes := make(map[uint32]int)

    for i := 0; i < 1000; i++ {
        specJson := strings.Replace(podSpec, "@@VERSION@@", strconv.Itoa(i), 1)
        spec := api.PodTemplateSpec{}
        json.Unmarshal([]byte(specJson), &spec)
        hash := GetPodTemplateSpecHash(spec)
        fmt.Printf("version: %d hash: %d\n", i, hash)
        if v, ok := seenHashes[hash]; ok {
            t.Fatalf("Hash collision, old: %d new: %d", v, i)
        }
        seenHashes[hash] = i
    }
}
/usr/local/go/bin/go test -v k8s.io/kubernetes/pkg/util/pod -run ^TestPodTemplateSpecHash$
<snip>
version: 119 hash: 2663716803
version: 120 hash: 1102452667
version: 121 hash: 1297618876
version: 122 hash: 1492785085
version: 123 hash: 1687951294
version: 124 hash: 1883117503
version: 125 hash: 2078283712
version: 126 hash: 2273449921
version: 127 hash: 2468616130
version: 128 hash: 2663782339
version: 129 hash: 2858948548
version: 130 hash: 1297684412
version: 131 hash: 1492850621
version: 132 hash: 1688016830
version: 133 hash: 1883183039
version: 134 hash: 2078349248
version: 135 hash: 2273515457
version: 136 hash: 2468681666
version: 137 hash: 2663847875
version: 138 hash: 2859014084
version: 139 hash: 3054180293
version: 140 hash: 1492916157
version: 141 hash: 1688082366
version: 142 hash: 1883248575
version: 143 hash: 2078414784
version: 144 hash: 2273580993
version: 145 hash: 2468747202
version: 146 hash: 2663913411
version: 147 hash: 2859079620
version: 148 hash: 3054245829
version: 149 hash: 3249412038
version: 150 hash: 1688147902
version: 151 hash: 1883314111
version: 152 hash: 2078480320
version: 153 hash: 2273646529
version: 154 hash: 2468812738
version: 155 hash: 2663978947
version: 156 hash: 2859145156
version: 157 hash: 3054311365
version: 158 hash: 3249477574
version: 159 hash: 3444643783
version: 160 hash: 1883379647
version: 161 hash: 2078545856
version: 162 hash: 2273712065
version: 163 hash: 2468878274
version: 164 hash: 2664044483
version: 165 hash: 2859210692
version: 166 hash: 3054376901
version: 167 hash: 3249543110
version: 168 hash: 3444709319
version: 169 hash: 3639875528
version: 170 hash: 2078611392
version: 171 hash: 2273777601
version: 172 hash: 2468943810
version: 173 hash: 2664110019
version: 174 hash: 2859276228
version: 175 hash: 3054442437
version: 176 hash: 3249608646
version: 177 hash: 3444774855
version: 178 hash: 3639941064
version: 179 hash: 3835107273
version: 180 hash: 2273843137
version: 181 hash: 2469009346
version: 182 hash: 2664175555
version: 183 hash: 2859341764
version: 184 hash: 3054507973
version: 185 hash: 3249674182
version: 186 hash: 3444840391
version: 187 hash: 3640006600
version: 188 hash: 3835172809
version: 189 hash: 4030339018
version: 190 hash: 2469074882
version: 191 hash: 2664241091
version: 192 hash: 2859407300
version: 193 hash: 3054573509
version: 194 hash: 3249739718
version: 195 hash: 3444905927
version: 196 hash: 3640072136
version: 197 hash: 3835238345
version: 198 hash: 4030404554
version: 199 hash: 4225570763
version: 200 hash: 907286458
version: 201 hash: 1102452667
    pod_test.go:98: Hash collision, old: 120 new: 201
exit status 1
FAIL    k8s.io/kubernetes/pkg/util/pod  0.043s

I'm guessing we can't improve the hash function at this point, to maintain backwards compatibility? We should at least handle the collision somehow - what do you suggest?

@jsravn
Copy link
Contributor

jsravn commented Oct 18, 2016

Further, in the example I gave above, the hash output only has about 200 unique values, after which all newer versions collide with previous hashed values.

<snip>
    pod_test.go:98: Hash collision, old: 120 new: 201
    pod_test.go:98: Hash collision, old: 121 new: 202
    pod_test.go:98: Hash collision, old: 122 new: 203
    pod_test.go:98: Hash collision, old: 123 new: 204
    pod_test.go:98: Hash collision, old: 124 new: 205
    pod_test.go:98: Hash collision, old: 125 new: 206
    pod_test.go:98: Hash collision, old: 126 new: 207
    pod_test.go:98: Hash collision, old: 127 new: 208
    pod_test.go:98: Hash collision, old: 128 new: 209
    pod_test.go:98: Hash collision, old: 130 new: 211
    pod_test.go:98: Hash collision, old: 131 new: 212
    pod_test.go:98: Hash collision, old: 132 new: 213
    pod_test.go:98: Hash collision, old: 133 new: 214
    pod_test.go:98: Hash collision, old: 134 new: 215
    pod_test.go:98: Hash collision, old: 135 new: 216
    pod_test.go:98: Hash collision, old: 136 new: 217
    pod_test.go:98: Hash collision, old: 137 new: 218
    pod_test.go:98: Hash collision, old: 138 new: 219
    pod_test.go:98: Hash collision, old: 140 new: 221
    pod_test.go:98: Hash collision, old: 141 new: 222
    pod_test.go:98: Hash collision, old: 142 new: 223
    pod_test.go:98: Hash collision, old: 143 new: 224
    pod_test.go:98: Hash collision, old: 144 new: 225
    pod_test.go:98: Hash collision, old: 145 new: 226
    pod_test.go:98: Hash collision, old: 146 new: 227
    pod_test.go:98: Hash collision, old: 147 new: 228
    pod_test.go:98: Hash collision, old: 148 new: 229
    pod_test.go:98: Hash collision, old: 150 new: 231
    pod_test.go:98: Hash collision, old: 151 new: 232
    pod_test.go:98: Hash collision, old: 152 new: 233
    pod_test.go:98: Hash collision, old: 153 new: 234
    pod_test.go:98: Hash collision, old: 154 new: 235
    pod_test.go:98: Hash collision, old: 155 new: 236
    pod_test.go:98: Hash collision, old: 156 new: 237
    pod_test.go:98: Hash collision, old: 157 new: 238
    pod_test.go:98: Hash collision, old: 158 new: 239
    pod_test.go:98: Hash collision, old: 160 new: 241
    pod_test.go:98: Hash collision, old: 161 new: 242
    pod_test.go:98: Hash collision, old: 162 new: 243
    pod_test.go:98: Hash collision, old: 163 new: 244
    pod_test.go:98: Hash collision, old: 164 new: 245
    pod_test.go:98: Hash collision, old: 165 new: 246
    pod_test.go:98: Hash collision, old: 166 new: 247
    pod_test.go:98: Hash collision, old: 167 new: 248
    pod_test.go:98: Hash collision, old: 168 new: 249
    pod_test.go:98: Hash collision, old: 170 new: 251
    pod_test.go:98: Hash collision, old: 171 new: 252
    pod_test.go:98: Hash collision, old: 172 new: 253
    pod_test.go:98: Hash collision, old: 173 new: 254
    pod_test.go:98: Hash collision, old: 174 new: 255
    pod_test.go:98: Hash collision, old: 175 new: 256
<snip>

@jsravn
Copy link
Contributor

jsravn commented Oct 18, 2016

I suppose the immediate workaround for users is to limit deployment revision history to small values (http://kubernetes.io/docs/user-guide/deployments/#revision-history-limit).

@0xmichalis
Copy link
Contributor

@jsravn thanks for the test! I run it as well and I see the same results. @smarterclayton @bgrant0607 can you have a look? An immediate workaround is to use revisionHistoryLimit, yes. In reality most users won't care about the history of 200 replica sets but it is concerning that Adler breaks that fast.

@bgrant0607
Copy link
Member

cc @janetkuo @pwittrock

@bgrant0607 bgrant0607 added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed team/cluster labels Oct 18, 2016
@smarterclayton
Copy link
Contributor

Adler was only used because we were originally overly concerned about the hash performance. We need to be using a reasonably (but not excessively) fast hash where collision chance is >> 1 in 1e5, and preferably > 1e10 or more.

@smarterclayton
Copy link
Contributor

fnv would be an easy change

@janetkuo janetkuo added the kind/bug Categorizes issue or PR as related to a bug. label Oct 19, 2016
@saad-ali saad-ali modified the milestones: v1.6, 1.6 Nov 12, 2016
@0xmichalis
Copy link
Contributor

Tested fnv with the unit test provided by @jsravn above and it didn't break for 10^3 up to 10^5 (10^6 was timing out the test and I didn't bother to test more).

@0xmichalis
Copy link
Contributor

Results from benchmarking

BenchmarkAdler-4   	   30000	     56445 ns/op
BenchmarkFnv-4     	   30000	     58152 ns/op
BenchmarkAdler-4   	  500000	     54725 ns/op
BenchmarkFnv-4     	  500000	     59724 ns/op

Fnv is a bit slower but much more stable.

@0xmichalis
Copy link
Contributor

@smarterclayton so my understanding is that the migration of old replica sets will happen during the normal operation of the cluster, is that correct? We can migrate all replica sets with zero replicas in the background, and migrate running sets once a new rollout happens. We may not even need a separate queue. I am still not sure if we can notify admins somehow when all existing deployments have migrated.

@smarterclayton
Copy link
Contributor

We could expose a metric that is "how many old replication controllers we still have"

@0xmichalis
Copy link
Contributor

For 1.6, we moved Deployments into the apps api group[1] so we can change the defaults (revisionHistoryLimit is now set to 3 by default for newly created deployments) and we also made the cleanup policy run independent of a rollout (so a rollout that gets stuck won't block deletiion of older replica sets)[2]. I've also opened an update to the Deployment proposal for moving away from hashing: kubernetes/community#384

Moving the milestone to 1.7

[1] #39683
[2] #40081

@0xmichalis 0xmichalis added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 4, 2017
@thockin
Copy link
Member

thockin commented Mar 21, 2017

Changing revision history doesn't seem to alleviate hash collisions, right? Looking at #43449 - are we using a poor hash for this?

@0xmichalis
Copy link
Contributor

Changing revision history doesn't seem to alleviate hash collisions, right?

Actually, it does. If all you have is 3 old replica sets, it's less likely that the controller will break you because of a hash collision. 200 seems to be the limit with adler.

Looking at #43449 - are we using a poor hash for this?

Yes. There is #38714 that changes us to fnv but there is also kubernetes/community#384 that moves us away from hashing. I am going to open an alternative proposal to 384 that covers the transition to fnv and we need to decide with what do we want to proceed.

@0xmichalis
Copy link
Contributor

Opened kubernetes/community#477 as an alternative to kubernetes/community#384

k8s-github-robot pushed a commit that referenced this issue May 25, 2017
Automatic merge from submit-queue

Switch Deployments to new hashing algo w/ collision avoidance mechanism

Implements kubernetes/community#477

@kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews 

Fixes #29735
Fixes #43948

```release-note
Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore.
```
@brugz
Copy link

brugz commented Jun 8, 2017

We saw this in a related configuration. Marking --record=true on the deployment [1] and setting the revision history [2] configuration seemed to alleviate the stress. Is there a status on the hash collision bug?

[1] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[2] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#revision-history-limit

Thanks,
Brugz

@janetkuo
Copy link
Member

janetkuo commented Jun 8, 2017

@brugz the fix will be included in 1.7 release

perotinus pushed a commit to kubernetes-retired/cluster-registry that referenced this issue Sep 2, 2017
Automatic merge from submit-queue

Switch Deployments to new hashing algo w/ collision avoidance mechanism

Implements kubernetes/community#477

@kubernetes/sig-apps-api-reviews @kubernetes/sig-apps-pr-reviews 

Fixes kubernetes/kubernetes#29735
Fixes kubernetes/kubernetes#43948

```release-note
Deployments are updated to use (1) a more stable hashing algorithm (fnv) than the previous one (adler) and (2) a hashing collision avoidance mechanism that will ensure new rollouts will not block on hashing collisions anymore.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/workload-api/deployment kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

Successfully merging a pull request may close this issue.