-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: Make etcd member removal idempotent #117724
kubeadm: Make etcd member removal idempotent #117724
Conversation
/sig cluster-lifecycle |
/cc @neolit123 |
I mistakenly had a double import (one already existed under an alias). Because it's a small change, I'll just amend my commit, and force-push. |
eb85ba6
to
65e306d
Compare
Pod terminated unexpectedly. /retest pull-kubernetes-dependencies |
@dlipovetsky: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-kubernetes-dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold
thanks, @dlipovetsky the change seems ok, but needs more eyes because this is a sensitive area and we don't have sufficient tests for all cases.
@@ -283,7 +285,7 @@ func (c *Client) GetMemberID(peerURL string) (uint64, error) { | |||
return member.GetID(), nil | |||
} | |||
} | |||
return 0, nil | |||
return 0, ErrNoMemberIDForPeerURL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems valid, but i wonder if it will break something. are we running GetMemberID() in more places where we expect no error? if yes, we should probably adapt all code around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! This is a great question.
The only other call site is https://github.com/kubernetes/kubernetes/blob/65e306d7d85dcfcbb9e3f5e7373f5469cc4437c7/cmd/kubeadm/app/phases/etcd/local.go#L180.
I've mentally walked this code.
Previous to this change, if GetMemberID
did not find a member ID for the peer, it returned 0
. Afterward, MemberPromote
logged the message [etcd] Member 0 already promoted.
The function continued to run, possibly returning an error later from the call to WaitForClusterAvailable
.
With this change, if GetMemberID
does not find the member ID, it returns an error, and that error is immediately returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks valid.
Only the learner mode will use it.
kubernetes/cmd/kubeadm/app/phases/etcd/local.go
Lines 175 to 184 in 28247d5
if features.Enabled(cfg.FeatureGates, features.EtcdLearnerMode) { | |
learnerID, err := etcdClient.GetMemberID(etcdPeerAddress) | |
if err != nil { | |
return err | |
} | |
err = etcdClient.MemberPromote(learnerID) | |
if err != nil { | |
return err | |
} | |
} |
The change is OK if the GetMemberID returns the error, it is not valid and should return err. Or the promote will fail bellow.
I noticed that we don't have tests that exercise the etcd client. I think we could address this with an integration test, i.e. run 3 etcd processes. I think that should work in prow... |
do you mean the etcd client in etcd? we only use kinder for testing kubeadm binary calls, with the exception of some CLI integration tests in k/k. basing tests on kinder is a req since we'd ideally want to keep all test jobs generated and in the same place. not sure how we can tell the etcd client in kubeadm to do any arbitrary action foo. IMO ideally we should have some kubeadm etcd code unit tests with fake etcd server / client. not sure to what extent etcd code supports that. |
👍 A fake etcd server seems doable. Update: I took a look. I don't think a fake etcd server is doable after all. I would like to use a fake etcd client, but currently we instantiate the real etcd client every time we use it, so there's no way to replace it with a fake client. To do that, I'd need to refactor a bit. I'll take a closer look. |
if errors.Is(etcdutil.ErrNoMemberIDForPeerURL, err) { | ||
klog.V(5).Infof("[etcd] member was already removed, because no member id exists for peer %s", etcdPeerAddress) | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RemoveMember
will handle ErrNoMemberIDForPeerURL
later in the function.
Can we pass through so that the log of removing an etcd member will be printed and for file delete, not-found is a successful state? I don't have a strong opinion on this, but I believe it would make the log more readable for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass through so that the log of removing an etcd member will be printed and for file delete, not-found is a successful state?
Thanks for the review @pacoxu!
I'm sorry, I can't understand what you're asking here. Could you please rephrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is inside func RemoveStackedEtcdMemberFromCluster
.
With the current change, the log (if verbose is less than 5) will show only [etcd] get the member id from peer: %s
and return.
I prefer to change the log level of line 122 higher so that users can know why. @neolit123 @SataQiu WDYT?
Sorry. My last comment was wrong. If there is an error here, the remove member will use an empty ID which is odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for rephrasing!
I prefer to change the log level of line 122 higher so that users can know why.
I chose log level 5 for two reasons. First, because it's not an error, so I don't want to alert the user unless they've asked for increased verbosity. Second, because all log messages that indicate a problem (i.e. start with Failed
) use level 5.
Of course I can reduce the log level, if you feel strongly. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
overall
@neolit123 I created a fake etcd client and wrote a unit test of I could add this to my PR, but I'm not sure I should. WDYT? |
new PR for these tests seems better |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dlipovetsky, neolit123, pacoxu, SataQiu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
52c3f55
to
dccd157
Compare
I squashed the fixup. |
Looks like #117792 will merge first. That adds a unit test for |
dccd157
to
20d93a1
Compare
I rebased on master to get the unit test added in #117792. Reviewers: The code you reviewed is in the first commit, and the rebase did not change it. I added a fixup that updates the unit test. I'll squash the fixup once I get an /lgtm. |
Failure appears unrelated:
/test pull-kubernetes-e2e-kind-ipv6 |
LGTM label has been added. Git tree hash: 313e4bfca9e561239fe827832eca3fde1adaaff2
|
If the etcd member is not found, then it has already been removed, and kubeadm reset should immediately complete the 'remove-etcd-member' phase. Previously, the phase would complete only once the exponential-backoff retry expired, up to 3 minutes duration. This commit also fixes a semantic error in etcd.GetMemberID. Previously, the function returned 0 if no member was found, but 0 is not a valid member ID.
20d93a1
to
5fd5768
Compare
I've squashed the fixup commit. |
@neolit123 I think this has gotten reviews you asked for, and it includes new unit tests from #117792. Are you ok to cancel the /hold? |
/hold cancel |
@dlipovetsky can you please cherry pick this to 1.24 as well ? |
What type of PR is this?
/kind bug
What this PR does / why we need it:
With this change, if the etcd member is not found, then it has already been removed, and
kubeadm reset
immediately completes theremove-etcd-member
phase.Previously, the phase would complete only once the exponential-backoff retry expired, up to 3 minutes duration. Below is what this looks like; note that
kubeadm reset
fails to delete the member ID0
, but eventually completes theremove-etcd-member
phase, and then the remaining phases:This change also fixes a semantic bug in
etcd.GetMemberID
. The function now returns an error if no member ID is found for the peer. Previously, the function returned 0 if no member was found, but 0 is not a valid member ID.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This change does not change the existing behavior of the
remove-etcd-member
phase. It only shortens the amount of time thatkubeadm reset
spends in that phase when the etcd member ID is not found for the peer.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: