-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only rotate certificates in the background #58930
Only rotate certificates in the background #58930
Conversation
@kubernetes/sig-cluster-lifecycle-pr-reviews @mikedanese @liggitt @deads2k In openshift we are using node bootstrapping and cert rotation (client and server certs). As I was going to enable static pods for master configuration, I realized that the foreground pause for bootstrapping caused a pointless and long delay (because This removes the foreground rotation but preserves all the error handling behavior, which then means that a master node configured for both bootstrapping AND static pods won't have an arbitrarily long wait before static pods start in some cases (before the client eventually times out and errors). It changes no behavior observable to an outsider except the following:
For 2, since the only difference is error type (ECONNREFUSED vs a TLS specific error) I don't think there's a meaningful impact or a regression in outcome. The underlying intent I think has always been for bootstrapping to be backgroundable (certainly we treat the kubelet server startup as a bunch of background loops rather than as a single process). This could be a backport candidate for at least 1.9 if we want to make self hosting less annoying to 1.9 users. |
The certificate manager originally had a "block on startup" rotation behavior to ensure at least one rotation happened on startup. However, since rotation may not succeed within the first time window the code was changed to simply print the error rather than return it. This meant that the blocking rotation has no purpose - it cannot cause the kubelet to fail, and it *does* block the kubelet from starting static pods before the api server becomes available. The current block behavior causes a bootstrapped kubelet that is also set to run static pods to wait several minutes before actually launching the static pods, which means self-hosted masters using static pods have a pointless delay on startup. Since blocking rotation has no benefit and can't actually fail startup, this commit removes the blocking behavior and simplifies the code at the same time. The goroutine for rotation now completely owns the deadline, the shouldRotate() method is removed, and the method that sets rotationDeadline now returns it. We also explicitly guard against a negative sleep interval and omit the message. Should have no impact on bootstrapping except the removal of a long delay on startup before static pods start. Also add a guard condition where if the current cert in the store is expired, we fall back to the bootstrap cert initially (we use the bootstrap cert to communicate with the server). This is consistent with when we don't have a cert yet.
ffab1fd
to
44493de
Compare
@timothysc this looks like the minimal thing I needed to get static pods working with bootstrapping. I think it's generally relevant for everyone. |
// current certificate should be rotated, 80%+/-10% of the expiration of the | ||
// certificate. | ||
func (m *manager) setRotationDeadline() { | ||
func (m *manager) nextRotationDeadline() time.Time { | ||
// forceRotation is not protected by locks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? comment could use some justification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was probably being overzealous, the only thing protected by certAccessLock is... cert. Before, there was a set of local vars that were called in two different goroutines, now everything is in a single goroutine. I can remove the comment.
change seems reasonable. it might cause some spurious authorization errors early in kubelet boot but that doesn't seem like an issue. |
And we will kill ourselves after X min if we can't get a client cert (crash loop) which gives time for static pods to init |
seems reasonable to me. @jbeda @mattmoyer - thoughts? |
LGTM |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mikedanese, smarterclayton The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
Change the Kubelet to not block until the first certs have rotated (we didn't act on it anyway) and fall back to the bootstrap cert if the most recent rotated cert is expired on startup.
The certificate manager originally had a "block on startup" rotation behavior to ensure at least one rotation happened on startup. However, since rotation may not succeed within the first time window the code was changed to simply print the error rather than return it. This meant that the blocking rotation has no purpose - it cannot cause the kubelet to fail, and it does block the kubelet from starting static pods before the api server becomes available.
The current block behavior causes a bootstrapped kubelet that is also set to run static pods to wait several minutes before actually launching the static pods, which means self-hosted masters using static pods have a pointless delay on startup.
Since blocking rotation has no benefit and can't actually fail startup, this commit removes the blocking behavior and simplifies the code at the same time. The goroutine for rotation now completely owns the deadline, the shouldRotate() method is removed, and the method that sets rotationDeadline now returns it. We also explicitly guard against a negative sleep interval and omit the message.
Should have no impact on bootstrapping except the removal of a long delay on startup before static pods start.
The other change is that an expired certificate from the cert manager is not considered a valid cert, which triggers an immediate rotation. This causes the cert manager to fall back to the original bootstrap certificate until a new certificate is issued. This allows the bootstrap certificate on masters to be "higher powered" and allow the node to function prior to initial approval, which means someone configuring the masters with a pre-generated client cert can be guaranteed that the kubelet will be able to communicate to report self-hosted static pod status, even if the first client rotation hasn't happened. This makes master self-hosting more predictable for static configuration environments.