-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollback etcd server version to 3.1.11 due to #60589 #60891
Rollback etcd server version to 3.1.11 due to #60589 #60891
Conversation
/approve no-issue @jpbetz @smarterclayton @kubernetes/sig-api-machinery-bugs |
I agree we should revert to etcd server 3.1 given that we don't understand why the performance regression occurred. |
@jdumars - I'm approving this PR for milestone to fix significant performance regression. |
@timothysc - why "do-not-merge" ? |
There are a series of fixes in 3.2 that fix catastrophic failure conditions that we can't negate imo. |
/cc @kubernetes/sig-cluster-lifecycle-bugs - PSA that this effects scale deployments but at the cost of other known fixes. |
[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process Pull Request Labels
|
cluster/gce/manifests/etcd.manifest
Outdated
@@ -22,14 +22,14 @@ | |||
"command": [ | |||
"/bin/sh", | |||
"-c", | |||
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://127.0.0.1:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1" | |||
"if [ -e /usr/local/bin/migrate-if-needed.sh ]; then /usr/local/bin/migrate-if-needed.sh 1>>/var/log/etcd{{ suffix }}.log 2>&1; fi; exec /usr/local/bin/etcd --name etcd-{{ hostname }} --listen-peer-urls {{ etcd_protocol }}://{{ host_ip }}:{{ server_port }} --initial-advertise-peer-urls {{ etcd_protocol }}://{{ hostname }}:{{ server_port }} --advertise-client-urls http://{{ hostname }}:{{ port }} --listen-client-urls http://127.0.0.1:{{ port }} {{ quota_bytes }} --data-dir /var/etcd/data{{ suffix }} --initial-cluster-state {{ cluster_state }} --initial-cluster {{ etcd_cluster }} {{ etcd_creds }} 1>>/var/log/etcd{{ suffix }}.log 2>&1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not revert this line - this was net improvement and we will need it again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need it again in the future
To confirm - do we know if this listen-client-url change works with 3.1.11 currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that people are generally not doing etcd upgrades together with version upgrades. So because 1.9 manifest doesn't work with 3.2.16, the order has to be: "upgrade to 1.10 and only the upgrade etcd". So it has to work.
[And yes - it works]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting to know, thanks. Fixed it.
6930845
to
ba6bb99
Compare
The gce-large-performance pre-submit failed even before build. Seems like docker daemon unavailable:
cc @krzyzacy @BenTheElder - Any idea what's going wrong? I defined the |
/test pull-kubernetes-e2e-gce-large-performance |
/test pull-kubernetes-e2e-gce-large-performance |
Thank you all for untangling this thorny issue. Please keep me posted with @jdumars as things progress. This is a great example of the strength of our community. |
So I tested this PR manually against a 2k-node cluster and things seem fine: wrt
and pod-startup latency:
|
@@ -24,4 +24,4 @@ spec: | |||
dnsPolicy: Default | |||
containers: | |||
- name: etcd-empty-dir-cleanup | |||
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.10.0 | |||
image: k8s.gcr.io/etcd-empty-dir-cleanup:3.1.11.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a 3.1.11.0 in gcr.io. The etcd version of etcd-empty-dir-cleanup is only for the version of etcdctl copied into the container image which hasn't changed between 3.1.10 and 3.1.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In that case, do you want me to:
- keep the tag at 3.1.10.0, or
- change it to 3.1.11.0 (to sync with etcd version used in the makefile) and push a new image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh. To be consistent, let's publish 3.1.11.0..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of we could just tag the 3.1.10.0 image with 3.1.11.0 as well 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the former, to avoid messing up the tag wrt the underlying etcd version actually used to build the image.
That said, I failed with the following error while running make push
:
The push refers to a repository [staging-k8s.gcr.io/etcd-empty-dir-cleanup]
41ad6ade37a0: Pushing [==================================================>] 14.32MB/14.32MB
8b8608fe70b0: Pushing [==================================================>] 14.32MB/14.32MB
c5183829c43c: Layer already exists
read tcp [2a00:79e0:2:11:cdf:6d7d:727b:913e]:37568->[2a00:1450:400c:c06::52]:443: use of closed network connection
@krzyzacy @BenTheElder Is this somehow related to that docker auth issue? Leads on how to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it. Getting:
ERROR: (gcloud.beta.auth.configure-docker) Error writing Docker configuration to disk: [Errno 2] No such file or directory: '/usr/local/google/home/shyamjvs/.docker/tmptYLal3'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.. I think I fixed it by creating that dir. Let me try pushing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still fails, but with:
unexpected EOF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the gcloud docker -- push vs docker push issue again - have you upgraded your workstation yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally managed to make it work by replacing docker
with gcloud docker
and using the right credentials allowing me to push.
@jpbetz - We're good to go here now.
oh, forgot to mention default make build need docker-in-docker, seems that's resolved, thanks @dims for catching this :-) |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz, shyamjvs, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
/test pull-kubernetes-e2e-gke |
@jdumars pull-kubernetes-e2e-gce-large-performance runs tests against a 2k-node cluster. Triggering it right now will mostly fail as I'm already running a 5k-node cluster manually in that project and we don't have enough quota to accommodate both :) |
Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@shyamjvs: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Ref #60589 (comment)
The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump.
So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :)
/cc @wojtek-t @jpbetz
(I'm not sure if we should instead remove release-notes of the original PRs)