Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap how long the kubelet waits when it has no client cert #59316

Merged
merged 2 commits into from
Feb 21, 2018

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Feb 4, 2018

If we go a certain amount of time without being able to create a client
cert and we have no current client cert from the store, exit. This
prevents a corrupted local copy of the cert from leaving the Kubelet in a
zombie state forever. Exiting allows a config loop outside the Kubelet
to clean up the file or the bootstrap client cert to get another client
cert.

Five minutes is a totally arbitary timeout, judged to give enough time for really slow static pods to boot.

@mikedanese

Set an upper bound (5 minutes) on how long the Kubelet will wait before exiting when the client cert from disk is missing or invalid. This prevents the Kubelet from waiting forever without attempting to bootstrap a new client credentials.

If we go a certain amount of time without being able to create a client
cert and we have no current client cert from the store, exit. This
prevents a corrupted local copy of the cert from leaving the Kubelet in a
zombie state forever. Exiting allows a config loop outside the Kubelet
to clean up the file or the bootstrap client cert to get another client
cert.
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 4, 2018
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2018
@smarterclayton
Copy link
Contributor Author

One more change I need to add here - when we don't rotate up front (#58930), there's a race condition where the new transport is used before the old cert is loaded. It only happens when the cert on disk is expired. Two options:

  1. Store should only ever return valid certs (but then we'd have to fall back the original cert in the transport code). Would be a lot simpler in the end, but a bigger short term refactor.
  2. Manager should set the bootstrap cert into the store if the current store cert is invalid. Easiest in the short term.

I'm leaning towards 2 simply because it does the least violence to the current rotation code. In 1.11 I'd like to refactor it down to be much simpler.

Otherwise, the certificate store will return nil the first time a store
cert is accessed. When background rotation is being used, prevents the
client from being nil.
@smarterclayton
Copy link
Contributor Author

/retest

@mikedanese mikedanese self-assigned this Feb 4, 2018
@smarterclayton
Copy link
Contributor Author

@liggitt any opinions?

@liggitt
Copy link
Member

liggitt commented Feb 7, 2018

no strong opinion. possibly relatedly, got a report of kubelet creating 1 distinct CSR per minute... I didn't expect that to happen (expected the kubelet to reuse key and/or wait longer). Is this going to make that situation worse?

@smarterclayton
Copy link
Contributor Author

I haven't seen that, but I would say no. I think that has to be a crash looping "unable to write to disk" kubelet.

@mikedanese
Copy link
Member

I've been meaning to set aside some time to get back into this. I'll review today.

if clientCertificateManager.ServerHealthy() {
glog.Fatalf("The currently active client certificate has expired and the server is responsive, exiting.")

if exitAfter > 0 {
Copy link
Member

@mikedanese mikedanese Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of sentinel value 0 here. Can we validate exitAfter > 0 early in updateTransport, and assume it's > 0 when we get here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exitAfter was optional before. The alternative would be to refactor this entire loop to have two different code paths. 0 means "don't exit", what structure would you prefer?


if exitAfter > 0 {
now := time.Now()
if curr == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same without the if nesting?

if !clientCertificateManager.ServerHealthy() {
  glog.Errorf("server down, trying again")
  return
}
if curr != nil {
  lastCertAvailable = now
}
if now.After(lastCertAvailable.Add(exitAfter)) {
  glog.Fatalf("It has been %s since a valid client cert was found and the server is responsive, exiting.", exitAfter)
}
if now.After(curr.Leaf.NotAfter) {
  glog.Fatalf("The currently active client certificate has expired and the server is responsive, exiting.")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100%, although it may just be better if this is easier to reason about than if the exact message is precise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat hesitant to change the logic to look simpler but give worse messages. I agree this particular loop is ugly.

@mikedanese
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 20, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikedanese, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 21, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 58716, 59977, 59316, 59884, 60117). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 2bbaf43 into kubernetes:master Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants