-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: reduce the backoff time of AddMember for etcd #104134
kubeadm: reduce the backoff time of AddMember for etcd #104134
Conversation
Hi @ihgann. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
/priority backlog
/triage accepted
thanks for the PR @ihgann
i will have a look tomorrow.
looks like a flake |
/assign @neolit123 |
I think this should be fine, and won't break Cluster API in the same way as it did prior to the |
@@ -81,6 +81,7 @@ require ( | |||
github.com/stretchr/testify v1.7.0 | |||
github.com/vishvananda/netlink v1.1.0 | |||
github.com/vmware/govmomi v0.20.3 | |||
go.etcd.io/etcd/api/v3 v3.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, i cannot approve changes in the go.mod.
once we are done with the review here we can assign one of the maintainers for go.mod changes.
/retitle kubeadm: reduce the backoff time of AddMember for etcd @ihgann , under:
instead of |
AddMember
ok, this SGTM. please squash the commits to one and it can be merged. |
This change optimizes the kubeadm/etcd `AddMember` client-side function by stopping early in the backoff loop when a peer conflict is found (indicating the member has already been added to the etcd cluster). In this situation, the function will stop early and relay a call to `ListMembers` to fetch the current list of members to return. With this optimization, front-loading a `ListMembers` call is no longer necessary, as this functionally returns the equivalent response. This helps reduce the amount of time taken in situational cases where an initial client request to add a member is accepted by the server, but fails client-side. This situation is possible situationally, such as if network latency causes the request to timeout after it was sent and accepted by the cluster. In this situation, the following loop would occur and fail with an `ErrPeerURLExist` response, and would be stuck until the backoff timeout was met (roughly ~2min30sec currently). Testing Done: * Manual testing with an etcd cluster. Initial "AddMember` call was successful, and the etcd manifest file was identical to prior version of these files. Subsequent calls to add the same member succeeded immediately (retaining idempotency), and the resulting manifest file remains identical to previous version as well. The difference, this time, is the call finished ~2min25sec faster in an identical test in the environment tested with.
bcbcece
to
c8431f4
Compare
/lgtm thanks @ihgann |
/assign @liggitt |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ihgann, liggitt, neolit123 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This change optimizes the kubeadm/etcd
AddMember
client-side function by stopping early in the backoff loop when a peer conflict is found (indicating the member has already been added to the etcd cluster). In this situation, the function will stop early and relay a call toListMembers
to fetch the current list of members to return. With this optimization, front-loading aListMembers
call is no longer necessary, as this functionally returns the equivalent response.This helps reduce the amount of time taken in situational cases where an initial client request to add a member is accepted by the server, but fails client-side.
This situation is possible situationally, such as if network latency causes the request to timeout after it was sent and accepted by the cluster. In this situation, the following loop would occur and fail with an
ErrPeerURLExist
response, and would be stuck until the backoff timeout was met (roughly ~2min30sec currently).Which issue(s) this PR fixes:
I have not opened an issue publicly for this, but the issue is described above.
Special notes for your reviewer:
n/a
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: