Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only rotate certificates in the background #58930

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 28, 2018

Change the Kubelet to not block until the first certs have rotated (we didn't act on it anyway) and fall back to the bootstrap cert if the most recent rotated cert is expired on startup.

The certificate manager originally had a "block on startup" rotation behavior to ensure at least one rotation happened on startup. However, since rotation may not succeed within the first time window the code was changed to simply print the error rather than return it. This meant that the blocking rotation has no purpose - it cannot cause the kubelet to fail, and it does block the kubelet from starting static pods before the api server becomes available.

The current block behavior causes a bootstrapped kubelet that is also set to run static pods to wait several minutes before actually launching the static pods, which means self-hosted masters using static pods have a pointless delay on startup.

Since blocking rotation has no benefit and can't actually fail startup, this commit removes the blocking behavior and simplifies the code at the same time. The goroutine for rotation now completely owns the deadline, the shouldRotate() method is removed, and the method that sets rotationDeadline now returns it. We also explicitly guard against a negative sleep interval and omit the message.

Should have no impact on bootstrapping except the removal of a long delay on startup before static pods start.

The other change is that an expired certificate from the cert manager is not considered a valid cert, which triggers an immediate rotation. This causes the cert manager to fall back to the original bootstrap certificate until a new certificate is issued. This allows the bootstrap certificate on masters to be "higher powered" and allow the node to function prior to initial approval, which means someone configuring the masters with a pre-generated client cert can be guaranteed that the kubelet will be able to communicate to report self-hosted static pod status, even if the first client rotation hasn't happened. This makes master self-hosting more predictable for static configuration environments.

When using client or server certificate rotation, the Kubelet will no longer wait until the initial rotation succeeds or fails before starting static pods.  This makes running self-hosted masters with rotation more predictable.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 28, 2018
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 28, 2018
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jan 28, 2018

@kubernetes/sig-cluster-lifecycle-pr-reviews @mikedanese @liggitt @deads2k

In openshift we are using node bootstrapping and cert rotation (client and server certs). As I was going to enable static pods for master configuration, I realized that the foreground pause for bootstrapping caused a pointless and long delay (because Start() on cert managers won't cause the Kubelet to fail to startup, just get printed to logs).

This removes the foreground rotation but preserves all the error handling behavior, which then means that a master node configured for both bootstrapping AND static pods won't have an arbitrarily long wait before static pods start in some cases (before the client eventually times out and errors). It changes no behavior observable to an outsider except the following:

  1. static pods can now start before client cert bootstrapping finishes (good)
  2. the kubelet app server can start before a server cert bootstrapping finishes (different error to clients, server returns an error instead of returning connection refused)

For 2, since the only difference is error type (ECONNREFUSED vs a TLS specific error) I don't think there's a meaningful impact or a regression in outcome. The underlying intent I think has always been for bootstrapping to be backgroundable (certainly we treat the kubelet server startup as a bunch of background loops rather than as a single process).

This could be a backport candidate for at least 1.9 if we want to make self hosting less annoying to 1.9 users.

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 28, 2018
The certificate manager originally had a "block on startup" rotation
behavior to ensure at least one rotation happened on startup. However,
since rotation may not succeed within the first time window the code was
changed to simply print the error rather than return it. This meant that
the blocking rotation has no purpose - it cannot cause the kubelet to
fail, and it *does* block the kubelet from starting static pods before
the api server becomes available.

The current block behavior causes a bootstrapped kubelet that is also
set to run static pods to wait several minutes before actually launching
the static pods, which means self-hosted masters using static pods have
a pointless delay on startup.

Since blocking rotation has no benefit and can't actually fail startup,
this commit removes the blocking behavior and simplifies the code at the
same time. The goroutine for rotation now completely owns the deadline,
the shouldRotate() method is removed, and the method that sets
rotationDeadline now returns it. We also explicitly guard against a
negative sleep interval and omit the message.

Should have no impact on bootstrapping except the removal of a long
delay on startup before static pods start.

Also add a guard condition where if the current cert in the store is
expired, we fall back to the bootstrap cert initially (we use the
bootstrap cert to communicate with the server). This is consistent with
when we don't have a cert yet.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2018
@smarterclayton
Copy link
Contributor Author

@timothysc this looks like the minimal thing I needed to get static pods working with bootstrapping. I think it's generally relevant for everyone.

@mikedanese mikedanese self-assigned this Jan 30, 2018
// current certificate should be rotated, 80%+/-10% of the expiration of the
// certificate.
func (m *manager) setRotationDeadline() {
func (m *manager) nextRotationDeadline() time.Time {
// forceRotation is not protected by locks
Copy link
Member

@mikedanese mikedanese Jan 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? comment could use some justification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was probably being overzealous, the only thing protected by certAccessLock is... cert. Before, there was a set of local vars that were called in two different goroutines, now everything is in a single goroutine. I can remove the comment.

@mikedanese
Copy link
Member

change seems reasonable. it might cause some spurious authorization errors early in kubelet boot but that doesn't seem like an issue.

@smarterclayton
Copy link
Contributor Author

And we will kill ourselves after X min if we can't get a client cert (crash loop) which gives time for static pods to init

@timothysc
Copy link
Member

seems reasonable to me. @jbeda @mattmoyer - thoughts?

@mikedanese
Copy link
Member

LGTM
/approve

@mikedanese
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikedanese, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants