Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm: reduce the backoff time of AddMember for etcd #104134

Conversation

ihgann
Copy link
Contributor

@ihgann ihgann commented Aug 4, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

This change optimizes the kubeadm/etcd AddMember client-side function by stopping early in the backoff loop when a peer conflict is found (indicating the member has already been added to the etcd cluster). In this situation, the function will stop early and relay a call to ListMembers to fetch the current list of members to return. With this optimization, front-loading a ListMembers call is no longer necessary, as this functionally returns the equivalent response.

This helps reduce the amount of time taken in situational cases where an initial client request to add a member is accepted by the server, but fails client-side.

This situation is possible situationally, such as if network latency causes the request to timeout after it was sent and accepted by the cluster. In this situation, the following loop would occur and fail with an ErrPeerURLExist response, and would be stuck until the backoff timeout was met (roughly ~2min30sec currently).

Which issue(s) this PR fixes:

I have not opened an issue publicly for this, but the issue is described above.

Special notes for your reviewer:

n/a

Does this PR introduce a user-facing change?

kubeadm: When adding an etcd peer to an existing cluster, if an error is returned indicating the peer has already been added, this is accepted and a ListMembers call is used instead to return the existing cluster. This helps diminish the exponential backoff when the first AddMember call times out, while still retaining a similar performance when the peer had already been added from a previous call.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2021
@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 4, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @ihgann. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 4, 2021
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/priority backlog
/triage accepted

thanks for the PR @ihgann
i will have a look tomorrow.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. area/dependency Issues or PRs related to dependency changes and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2021
@randomvariable
Copy link
Member

looks like a flake
/retest

@dims
Copy link
Member

dims commented Aug 5, 2021

/assign @neolit123

@randomvariable
Copy link
Member

I think this should be fine, and won't break Cluster API in the same way as it did prior to the ListMembers code being added. it should be caught by conformance main branch testing in any case.

@@ -81,6 +81,7 @@ require (
github.com/stretchr/testify v1.7.0
github.com/vishvananda/netlink v1.1.0
github.com/vmware/govmomi v0.20.3
go.etcd.io/etcd/api/v3 v3.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, i cannot approve changes in the go.mod.
once we are done with the review here we can assign one of the maintainers for go.mod changes.

@neolit123
Copy link
Member

/retitle kubeadm: reduce the backoff time of AddMember for etcd

@ihgann , under:

Does this PR introduce a user-facing change?

instead of None, could provide a one/two sentence summary of what problem we are trying to fix in this PR and how it would benefit them. should be prefixed with kubeadm: ....

@k8s-ci-robot k8s-ci-robot changed the title Reduce backoff time in etcd AddMember kubeadm: reduce the backoff time of AddMember for etcd Aug 5, 2021
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Aug 5, 2021
@neolit123
Copy link
Member

ok, this SGTM. please squash the commits to one and it can be merged.

This change optimizes the kubeadm/etcd `AddMember` client-side function
by stopping early in the backoff loop when a peer conflict is found
(indicating the member has already been added to the etcd cluster). In
this situation, the function will stop early and relay a call to
`ListMembers` to fetch the current list of members to return. With this
optimization, front-loading a `ListMembers` call is no longer necessary,
as this functionally returns the equivalent response.

This helps reduce the amount of time taken in situational cases where an
initial client request to add a member is accepted by the server, but
fails client-side.

This situation is possible situationally, such as if network latency
causes the request to timeout after it was sent and accepted by the
cluster. In this situation, the following loop would occur and fail with
an `ErrPeerURLExist` response, and would be stuck until the backoff
timeout was met (roughly ~2min30sec currently).

Testing Done:

* Manual testing with an etcd cluster. Initial "AddMember` call was
  successful, and the etcd manifest file was identical to prior version
  of these files. Subsequent calls to add the same member succeeded
  immediately (retaining idempotency), and the resulting manifest file
  remains identical to previous version as well. The difference, this
  time, is the call finished ~2min25sec faster in an identical test in
  the environment tested with.
@ihgann ihgann force-pushed the topic/ganni/optimize-kubeadm-etcd-member-add-2 branch from bcbcece to c8431f4 Compare August 5, 2021 20:12
@neolit123
Copy link
Member

/lgtm
/approve

thanks @ihgann
if for some reason we see problems for users in the wild after this change, i will give you ping.
(hopefully unlikely)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 5, 2021
@neolit123
Copy link
Member

/assign @liggitt
for the go.mod change (we are already vendoring go.etcd.io/etcd/api/v3)

@liggitt
Copy link
Member

liggitt commented Aug 5, 2021

/approve
for dep change

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ihgann, liggitt, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2021
@k8s-ci-robot k8s-ci-robot merged commit de4e500 into kubernetes:master Aug 5, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/dependency Issues or PRs related to dependency changes area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants