-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cap how long the kubelet waits when it has no client cert #59316
Cap how long the kubelet waits when it has no client cert #59316
Conversation
If we go a certain amount of time without being able to create a client cert and we have no current client cert from the store, exit. This prevents a corrupted local copy of the cert from leaving the Kubelet in a zombie state forever. Exiting allows a config loop outside the Kubelet to clean up the file or the bootstrap client cert to get another client cert.
One more change I need to add here - when we don't rotate up front (#58930), there's a race condition where the new transport is used before the old cert is loaded. It only happens when the cert on disk is expired. Two options:
I'm leaning towards 2 simply because it does the least violence to the current rotation code. In 1.11 I'd like to refactor it down to be much simpler. |
Otherwise, the certificate store will return nil the first time a store cert is accessed. When background rotation is being used, prevents the client from being nil.
/retest |
@liggitt any opinions? |
no strong opinion. possibly relatedly, got a report of kubelet creating 1 distinct CSR per minute... I didn't expect that to happen (expected the kubelet to reuse key and/or wait longer). Is this going to make that situation worse? |
I haven't seen that, but I would say no. I think that has to be a crash looping "unable to write to disk" kubelet. |
I've been meaning to set aside some time to get back into this. I'll review today. |
if clientCertificateManager.ServerHealthy() { | ||
glog.Fatalf("The currently active client certificate has expired and the server is responsive, exiting.") | ||
|
||
if exitAfter > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the use of sentinel value 0 here. Can we validate exitAfter > 0 early in updateTransport, and assume it's > 0 when we get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exitAfter was optional before. The alternative would be to refactor this entire loop to have two different code paths. 0 means "don't exit", what structure would you prefer?
|
||
if exitAfter > 0 { | ||
now := time.Now() | ||
if curr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same without the if nesting?
if !clientCertificateManager.ServerHealthy() {
glog.Errorf("server down, trying again")
return
}
if curr != nil {
lastCertAvailable = now
}
if now.After(lastCertAvailable.Add(exitAfter)) {
glog.Fatalf("It has been %s since a valid client cert was found and the server is responsive, exiting.", exitAfter)
}
if now.After(curr.Leaf.NotAfter) {
glog.Fatalf("The currently active client certificate has expired and the server is responsive, exiting.")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100%, although it may just be better if this is easier to reason about than if the exact message is precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat hesitant to change the logic to look simpler but give worse messages. I agree this particular loop is ugly.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mikedanese, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Automatic merge from submit-queue (batch tested with PRs 58716, 59977, 59316, 59884, 60117). If you want to cherry-pick this change to another branch, please follow the instructions here. |
If we go a certain amount of time without being able to create a client
cert and we have no current client cert from the store, exit. This
prevents a corrupted local copy of the cert from leaving the Kubelet in a
zombie state forever. Exiting allows a config loop outside the Kubelet
to clean up the file or the bootstrap client cert to get another client
cert.
Five minutes is a totally arbitary timeout, judged to give enough time for really slow static pods to boot.
@mikedanese