Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback etcd server version to 3.1.11 due to #60589 #60891

Merged
merged 2 commits into from
Mar 8, 2018

Conversation

shyamjvs
Copy link
Member

@shyamjvs shyamjvs commented Mar 7, 2018

Ref #60589 (comment)

The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)

/cc @wojtek-t @jpbetz

Downgrade default etcd server version to 3.1.11 due to #60589

(I'm not sure if we should instead remove release-notes of the original PRs)

@k8s-ci-robot k8s-ci-robot requested review from jpbetz and wojtek-t March 7, 2018 17:46
@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 7, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Mar 7, 2018

/approve no-issue

@jpbetz @smarterclayton @kubernetes/sig-api-machinery-bugs

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 7, 2018
@jpbetz
Copy link
Contributor

jpbetz commented Mar 7, 2018

I agree we should revert to etcd server 3.1 given that we don't understand why the performance regression occurred.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2018
@shyamjvs shyamjvs added this to the v1.10 milestone Mar 7, 2018
@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Mar 7, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Mar 7, 2018

@jdumars - I'm approving this PR for milestone to fix significant performance regression.
I hope you will be fine with that.

@timothysc timothysc added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Mar 7, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Mar 7, 2018

@timothysc - why "do-not-merge" ?

@timothysc
Copy link
Member

@wojtek-t @jpbetz - The failure conditions that are fixed in 3.2 are of far greater importance imo.

@timothysc
Copy link
Member

There are a series of fixes in 3.2 that fix catastrophic failure conditions that we can't negate imo.

/cc @hongchaodeng @xiang90

@timothysc
Copy link
Member

/cc @kubernetes/sig-cluster-lifecycle-bugs - PSA that this effects scale deployments but at the cost of other known fixes.

@k8s-ci-robot k8s-ci-robot added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Mar 7, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jpbetz @shyamjvs

Pull Request Labels
  • sig/api-machinery sig/cluster-lifecycle sig/scalability: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@@ -22,14 +22,14 @@
"command": [
"/bin/sh",
"-c",
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://127.0.0.1:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://{{ hostname }}:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not revert this line - this was net improvement and we will need it again in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need it again in the future

To confirm - do we know if this listen-client-url change works with 3.1.11 currently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that people are generally not doing etcd upgrades together with version upgrades. So because 1.9 manifest doesn't work with 3.2.16, the order has to be: "upgrade to 1.10 and only the upgrade etcd". So it has to work.
[And yes - it works]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting to know, thanks. Fixed it.

@shyamjvs shyamjvs force-pushed the go-back-to-etcd-3.1.10 branch from 6930845 to ba6bb99 Compare March 8, 2018 12:07
@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 8, 2018

The gce-large-performance pre-submit failed even before build. Seems like docker daemon unavailable:

I0308 11:44:33.287] make: Entering directory '/go/src/k8s.io/kubernetes'
I0308 11:44:33.288] +++ [0308 11:44:33] Verifying Prerequisites....
W0308 11:44:34.451] Can't connect to 'docker' daemon.  please fix and retry.
W0308 11:44:34.451] 
W0308 11:44:34.451] Possible causes:
W0308 11:44:34.452]   - Docker Daemon not started
W0308 11:44:34.452]     - Linux: confirm via your init system
W0308 11:44:34.452]     - macOS w/ docker-machine: run `docker-machine ls` and `docker-machine start <name>`
W0308 11:44:34.452]     - macOS w/ Docker for Mac: Check the menu bar and start the Docker application
W0308 11:44:34.453]   - DOCKER_HOST hasn't been set or is set incorrectly
W0308 11:44:34.453]     - Linux: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.453]     - macOS w/ docker-machine: run `eval "$(docker-machine env <name>)"`
W0308 11:44:34.453]     - macOS w/ Docker for Mac: domain socket is used, DOCKER_* should be unset. In Bash run `unset ${!DOCKER_*}`
W0308 11:44:34.454]   - Other things to check:
W0308 11:44:34.454]     - Linux: User isn't in 'docker' group.  Add and relogin.
W0308 11:44:34.454]       - Something like 'sudo usermod -a -G docker ${USER}'
W0308 11:44:34.454]       - RHEL7 bug and workaround: https://bugzilla.redhat.com/show_bug.cgi?id=1119282#c8

cc @krzyzacy @BenTheElder - Any idea what's going wrong? I defined the pull-kubernetes-e2e-gce-100-performance job similar to pull-kubernetes-e2e-gce. Ref: kubernetes/test-infra#7168

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 8, 2018

/test pull-kubernetes-e2e-gce-large-performance

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 8, 2018

/test pull-kubernetes-e2e-gce-large-performance
retrying after above fix

@jdumars
Copy link
Member

jdumars commented Mar 8, 2018

Thank you all for untangling this thorny issue. Please keep me posted with @jdumars as things progress. This is a great example of the strength of our community.

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 8, 2018

So I tested this PR manually against a 2k-node cluster and things seem fine:

wrt PUT pod-status latency:

{
      "data": {
        "Perc50": 1.485,
        "Perc90": 7.894,
        "Perc99": 43.414
      },
      "unit": "ms",
      "labels": {
        "Count": "450498",
        "Resource": "pods",
        "Scope": "namespace",
        "Subresource": "status",
        "Verb": "PUT"
      }
    },

and pod-startup latency:

INFO: perc50: 2.409912012s, perc90: 3.085908828s, perc99: 3.714040988s

@timothysc timothysc removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Mar 8, 2018
@@ -24,4 +24,4 @@ spec:
dnsPolicy: Default
containers:
- name: etcd-empty-dir-cleanup
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.10.0
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.11.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a 3.1.11.0 in gcr.io. The etcd version of etcd-empty-dir-cleanup is only for the version of etcdctl copied into the container image which hasn't changed between 3.1.10 and 3.1.11.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, do you want me to:

  • keep the tag at 3.1.10.0, or
  • change it to 3.1.11.0 (to sync with etcd version used in the makefile) and push a new image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. To be consistent, let's publish 3.1.11.0..

Copy link
Contributor

@jpbetz jpbetz Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of we could just tag the 3.1.10.0 image with 3.1.11.0 as well 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the former, to avoid messing up the tag wrt the underlying etcd version actually used to build the image.

That said, I failed with the following error while running make push:

The push refers to a repository [staging-k8s.gcr.io/etcd-empty-dir-cleanup]
41ad6ade37a0: Pushing [==================================================>]  14.32MB/14.32MB
8b8608fe70b0: Pushing [==================================================>]  14.32MB/14.32MB
c5183829c43c: Layer already exists 
read tcp [2a00:79e0:2:11:cdf:6d7d:727b:913e]:37568->[2a00:1450:400c:c06::52]:443: use of closed network connection

@krzyzacy @BenTheElder Is this somehow related to that docker auth issue? Leads on how to fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it. Getting:

ERROR: (gcloud.beta.auth.configure-docker) Error writing Docker configuration to disk: [Errno 2] No such file or directory: '/usr/local/google/home/shyamjvs/.docker/tmptYLal3'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.. I think I fixed it by creating that dir. Let me try pushing now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still fails, but with:

unexpected EOF

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the gcloud docker -- push vs docker push issue again - have you upgraded your workstation yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally managed to make it work by replacing docker with gcloud docker and using the right credentials allowing me to push.

@jpbetz - We're good to go here now.

@krzyzacy
Copy link
Member

krzyzacy commented Mar 8, 2018

oh, forgot to mention default make build need docker-in-docker, seems that's resolved, thanks @dims for catching this :-)

@jpbetz
Copy link
Contributor

jpbetz commented Mar 8, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, shyamjvs, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@jdumars
Copy link
Member

jdumars commented Mar 8, 2018

/test pull-kubernetes-e2e-gke
/test pull-kubernetes-e2e-gce-large-performance

@shyamjvs
Copy link
Member Author

shyamjvs commented Mar 8, 2018

@jdumars pull-kubernetes-e2e-gce-large-performance runs tests against a 2k-node cluster. Triggering it right now will mostly fail as I'm already running a 5k-node cluster manually in that project and we don't have enough quota to accommodate both :)

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 56195fd into kubernetes:master Mar 8, 2018
@shyamjvs shyamjvs deleted the go-back-to-etcd-3.1.10 branch March 8, 2018 21:13
@k8s-ci-robot
Copy link
Contributor

@shyamjvs: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce ba6bb99 link /test pull-kubernetes-e2e-gce
pull-kubernetes-e2e-gke ba6bb99 link /test pull-kubernetes-e2e-gke
pull-kubernetes-e2e-gce-large-performance ba6bb99 link /test pull-kubernetes-e2e-gce-large-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants