Rollback etcd server version to 3.1.11 due to #60589 #60891

shyamjvs · 2018-03-07T17:46:11Z

The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)

/cc @wojtek-t @jpbetz

Downgrade default etcd server version to 3.1.11 due to #60589

(I'm not sure if we should instead remove release-notes of the original PRs)

wojtek-t · 2018-03-07T17:48:22Z

/approve no-issue

@jpbetz @smarterclayton @kubernetes/sig-api-machinery-bugs

jpbetz · 2018-03-07T17:59:14Z

I agree we should revert to etcd server 3.1 given that we don't understand why the performance regression occurred.
/lgtm

wojtek-t · 2018-03-07T18:40:53Z

@jdumars - I'm approving this PR for milestone to fix significant performance regression.
I hope you will be fine with that.

wojtek-t · 2018-03-07T18:41:51Z

@timothysc - why "do-not-merge" ?

timothysc · 2018-03-07T18:41:57Z

@wojtek-t @jpbetz - The failure conditions that are fixed in 3.2 are of far greater importance imo.

timothysc · 2018-03-07T18:45:23Z

There are a series of fixes in 3.2 that fix catastrophic failure conditions that we can't negate imo.

/cc @hongchaodeng @xiang90

timothysc · 2018-03-07T18:55:02Z

/cc @kubernetes/sig-cluster-lifecycle-bugs - PSA that this effects scale deployments but at the cost of other known fixes.

k8s-github-robot · 2018-03-07T18:56:07Z

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jpbetz @shyamjvs

Pull Request Labels

sig/api-machinery sig/cluster-lifecycle sig/scalability: Pull Request will be escalated to these SIGs if needed.
priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
kind/bug: Fixes a bug discovered during the current release.

Help

wojtek-t · 2018-03-08T11:44:18Z

cluster/gce/manifests/etcd.manifest

@@ -22,14 +22,14 @@
    "command": [
              "/bin/sh",
              "-c",
-              "if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://127.0.0.1:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"
+              "if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://{{ hostname }}:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"


Let's not revert this line - this was net improvement and we will need it again in the future.

we will need it again in the future

To confirm - do we know if this listen-client-url change works with 3.1.11 currently?

Note that people are generally not doing etcd upgrades together with version upgrades. So because 1.9 manifest doesn't work with 3.2.16, the order has to be: "upgrade to 1.10 and only the upgrade etcd". So it has to work.
[And yes - it works]

That's interesting to know, thanks. Fixed it.

shyamjvs · 2018-03-08T13:05:37Z

The gce-large-performance pre-submit failed even before build. Seems like docker daemon unavailable:

I0308 11:44:33.287] make: Entering directory '/go/src/k8s.io/kubernetes'
I0308 11:44:33.288] +++ [0308 11:44:33] Verifying Prerequisites....
W0308 11:44:34.451] Can't connect to 'docker' daemon.  please fix and retry.
W0308 11:44:34.451] 
W0308 11:44:34.451] Possible causes:
W0308 11:44:34.452]   - Docker Daemon not started
W0308 11:44:34.452]     - Linux: confirm via your init system
W0308 11:44:34.452]     - macOS w/ docker-machine: run `docker-machine ls` and `docker-machine start <name>`
W0308 11:44:34.452]     - macOS w/ Docker for Mac: Check the menu bar and start the Docker application
W0308 11:44:34.453]   - DOCKER_HOST hasn't been set or is set incorrectly
W0308 11:44:34.453]     - Linux: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.453]     - macOS w/ docker-machine: run `eval "$(docker-machine env <name>)"`
W0308 11:44:34.453]     - macOS w/ Docker for Mac: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.454]   - Other things to check:
W0308 11:44:34.454]     - Linux: User isn't in 'docker' group.  Add and relogin.
W0308 11:44:34.454]       - Something like 'sudo usermod -a -G docker ${USER}'
W0308 11:44:34.454]       - RHEL7 bug and workaround: https://bugzilla.redhat.com/show_bug.cgi?id=1119282#c8

cc @krzyzacy @BenTheElder - Any idea what's going wrong? I defined the pull-kubernetes-e2e-gce-100-performance job similar to pull-kubernetes-e2e-gce. Ref: kubernetes/test-infra#7168

shyamjvs · 2018-03-08T13:07:30Z

/test pull-kubernetes-e2e-gce-large-performance

shyamjvs · 2018-03-08T13:28:54Z

/test pull-kubernetes-e2e-gce-large-performance
retrying after above fix

jdumars · 2018-03-08T16:00:24Z

Thank you all for untangling this thorny issue. Please keep me posted with @jdumars as things progress. This is a great example of the strength of our community.

shyamjvs · 2018-03-08T16:33:03Z

So I tested this PR manually against a 2k-node cluster and things seem fine:

wrt PUT pod-status latency:

{
      "data": {
        "Perc50": 1.485,
        "Perc90": 7.894,
        "Perc99": 43.414
      },
      "unit": "ms",
      "labels": {
        "Count": "450498",
        "Resource": "pods",
        "Scope": "namespace",
        "Subresource": "status",
        "Verb": "PUT"
      }
    },

and pod-startup latency:

INFO: perc50: 2.409912012s, perc90: 3.085908828s, perc99: 3.714040988s

jpbetz · 2018-03-08T17:53:52Z

cluster/addons/etcd-empty-dir-cleanup/etcd-empty-dir-cleanup.yaml

@@ -24,4 +24,4 @@ spec:
  dnsPolicy: Default
  containers:
  - name: etcd-empty-dir-cleanup
-    image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.10.0
+    image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.11.0


I don't see a 3.1.11.0 in gcr.io. The etcd version of etcd-empty-dir-cleanup is only for the version of etcdctl copied into the container image which hasn't changed between 3.1.10 and 3.1.11.

I see. In that case, do you want me to:

keep the tag at 3.1.10.0, or

change it to 3.1.11.0 (to sync with etcd version used in the makefile) and push a new image?

Sigh. To be consistent, let's publish 3.1.11.0..

Of we could just tag the 3.1.10.0 image with 3.1.11.0 as well 😁

I'd prefer the former, to avoid messing up the tag wrt the underlying etcd version actually used to build the image.

That said, I failed with the following error while running make push:

The push refers to a repository [staging-k8s.gcr.io/etcd-empty-dir-cleanup] 41ad6ade37a0: Pushing [==================================================>] 14.32MB/14.32MB 8b8608fe70b0: Pushing [==================================================>] 14.32MB/14.32MB c5183829c43c: Layer already exists read tcp [2a00:79e0:2:11:cdf:6d7d:727b:913e]:37568->[2a00:1450:400c:c06::52]:443: use of closed network connection

@krzyzacy @BenTheElder Is this somehow related to that docker auth issue? Leads on how to fix it?

I tried it. Getting:

ERROR: (gcloud.beta.auth.configure-docker) Error writing Docker configuration to disk: [Errno 2] No such file or directory: '/usr/local/google/home/shyamjvs/.docker/tmptYLal3'

Ok.. I think I fixed it by creating that dir. Let me try pushing now.

Still fails, but with:

unexpected EOF

I think that's the gcloud docker -- push vs docker push issue again - have you upgraded your workstation yet?

I finally managed to make it work by replacing docker with gcloud docker and using the right credentials allowing me to push.

@jpbetz - We're good to go here now.

krzyzacy · 2018-03-08T17:56:20Z

oh, forgot to mention default make build need docker-in-docker, seems that's resolved, thanks @dims for catching this :-)

jpbetz · 2018-03-08T19:17:21Z

/lgtm

k8s-ci-robot · 2018-03-08T19:17:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, shyamjvs, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~build/OWNERS~~ [wojtek-t]
~~cluster/OWNERS~~ [wojtek-t]
~~cmd/kubeadm/OWNERS~~ [wojtek-t]
~~hack/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/apiextensions-apiserver/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/sample-apiserver/OWNERS~~ [wojtek-t]
~~test/OWNERS~~ [shyamjvs,wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-03-08T19:18:42Z

/test all [submit-queue is verifying that this PR is safe to merge]

jdumars · 2018-03-08T20:31:51Z

/test pull-kubernetes-e2e-gke
/test pull-kubernetes-e2e-gce-large-performance

shyamjvs · 2018-03-08T20:38:12Z

@jdumars pull-kubernetes-e2e-gce-large-performance runs tests against a 2k-node cluster. Triggering it right now will mostly fail as I'm already running a 5k-node cluster manually in that project and we don't have enough quota to accommodate both :)

k8s-github-robot · 2018-03-08T20:45:46Z

Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot · 2018-03-09T05:48:32Z

@shyamjvs: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-gce	`ba6bb99`	link	`/test pull-kubernetes-e2e-gce`
pull-kubernetes-e2e-gke	`ba6bb99`	link	`/test pull-kubernetes-e2e-gke`
pull-kubernetes-e2e-gce-large-performance	`ba6bb99`	link	`/test pull-kubernetes-e2e-gce-large-performance`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot requested review from jpbetz and wojtek-t March 7, 2018 17:46

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 7, 2018

k8s-ci-robot assigned jpbetz Mar 7, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2018

shyamjvs added this to the v1.10 milestone Mar 7, 2018

shyamjvs added the status/approved-for-milestone label Mar 7, 2018

k8s-github-robot added the milestone/incomplete-labels label Mar 7, 2018

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Mar 7, 2018

k8s-github-robot removed the milestone/incomplete-labels label Mar 7, 2018

timothysc added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Mar 7, 2018

k8s-ci-robot requested review from hongchaodeng and xiang90 March 7, 2018 18:45

k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Mar 7, 2018

wojtek-t reviewed Mar 8, 2018

View reviewed changes

shyamjvs added 2 commits March 8, 2018 13:07

Rollback etcd server version to 3.1.11 due to kubernetes#60589

21f5e69

[Test change - don't merge] Skip load test

ba6bb99

shyamjvs force-pushed the go-back-to-etcd-3.1.10 branch from 6930845 to ba6bb99 Compare March 8, 2018 12:07

shyamjvs mentioned this pull request Mar 8, 2018

Use bazel build mode for scalability presubmits kubernetes/test-infra#7182

Merged

k8s-ci-robot closed this in kubernetes/test-infra#7182 Mar 8, 2018

wojtek-t reopened this Mar 8, 2018

timothysc removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Mar 8, 2018

shyamjvs mentioned this pull request Mar 8, 2018

Scheduler logs exploding in large clusters - Can crash master! #60933

Closed

jpbetz reviewed Mar 8, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2018

k8s-github-robot merged commit 56195fd into kubernetes:master Mar 8, 2018

shyamjvs deleted the go-back-to-etcd-3.1.10 branch March 8, 2018 21:13

shyamjvs mentioned this pull request Mar 9, 2018

Bump to etcd 3.1.12 to pick up critical fix #60998

Merged

timothysc mentioned this pull request Mar 9, 2018

SIG-scalability charter. kubernetes/community#1829

Closed

shyamjvs mentioned this pull request Apr 25, 2018

'PATCH node-status' latency slo violations #62064

Closed

joejulian mentioned this pull request Jul 2, 2018

kube-apiserver 1.10.[0-5] & 1.11.0 uses up all available cpu on arm64 #64649

Closed

Rollback etcd server version to 3.1.11 due to #60589 #60891

Rollback etcd server version to 3.1.11 due to #60589 #60891

Conversation

shyamjvs commented Mar 7, 2018 • edited Loading

wojtek-t commented Mar 7, 2018

jpbetz commented Mar 7, 2018 • edited Loading

wojtek-t commented Mar 7, 2018

wojtek-t commented Mar 7, 2018

timothysc commented Mar 7, 2018

timothysc commented Mar 7, 2018

timothysc commented Mar 7, 2018

k8s-github-robot commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shyamjvs commented Mar 8, 2018

shyamjvs commented Mar 8, 2018

shyamjvs commented Mar 8, 2018

jdumars commented Mar 8, 2018

shyamjvs commented Mar 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krzyzacy commented Mar 8, 2018

jpbetz commented Mar 8, 2018

k8s-ci-robot commented Mar 8, 2018

k8s-github-robot commented Mar 8, 2018

jdumars commented Mar 8, 2018

shyamjvs commented Mar 8, 2018

k8s-github-robot commented Mar 8, 2018

k8s-ci-robot commented Mar 9, 2018

shyamjvs commented Mar 7, 2018 •

edited

Loading

jpbetz commented Mar 7, 2018 •

edited

Loading

jpbetz Mar 8, 2018 •

edited

Loading