forked from cert-manager/cert-manager
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request cert-manager#4757 from irbekrm/exponential_backoff…
…_design Design for exponential backoff
- Loading branch information
Showing
1 changed file
with
231 additions
and
0 deletions.
There are no files selected for viewing
231 changes: 231 additions & 0 deletions
231
design/20220118.certificate-issuance-exponential-backoff.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,231 @@ | ||
--- | ||
title: Exponential Backoff for Certificate Issuance | ||
authors: | ||
- "@irbekrm" | ||
reviewers: | ||
- @jetstack/team-cert-manager | ||
approvers: | ||
- @jetstack/team-cert-manager | ||
editor: "@irbekrm" | ||
creation-date: 2022-01-18 | ||
last-updated: 2022-01-19 | ||
status: implementable | ||
--- | ||
|
||
# Exponential Backoff for Certificate Issuance | ||
|
||
## Table of Contents | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Must Not](#must-not) | ||
- [Proposal](#proposal) | ||
* [API Changes](#api-changes) | ||
* [Examples](#examples) | ||
- [Issuance fails then succeeds](#Issuance-fails-then-succeeds) | ||
- [Manually triggered reissuance succeeds](#Manually-triggered-reissuance-succeeds) | ||
- [Manually triggered reissuance fails](#Manually-triggered-reissuance-fails) | ||
- [Example certificate statuses](#example-certificate-statuses) | ||
* [Upgrading](#upgrading) | ||
* [Test Plan](#test-plan) | ||
* [Assumptions](#assumptions) | ||
* [Feature Gate](#feature-gate) | ||
* [Misc](#misc) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
|
||
Implement a way to apply exponential backoff when retrying failed certificate | ||
issuances. Issuance in this context refers to the period of time during which | ||
the `Issuing` condition on a `Certificate` is set to true and for which a new | ||
set of issuance-specific resources (`CertificateRequest`s, `Order`s, `Challenge`s etc) is | ||
created. | ||
|
||
## Motivation | ||
|
||
Currently failed issuances are retried once an hour without a backoff or time limit. This means that 1) continuous failures in large installations can overwhelm external services 2) rate limits can be easily hit in case of longer lasting issuance problems (see [Let'sEncrypt rate limts](https://letsencrypt.org/docs/rate-limits/)) | ||
|
||
### Goals | ||
|
||
- Ensure that retrying failed issuances does not overwhelm external services and is less likely to hit rate limits by adding exponentially increasing delays between issuance retries | ||
|
||
- Ensure that when the backoff is being applied, users have a way to find out when the issuance will be next retried and that the backoff mechanism does not introduce extra complexity for debugging and fits in with the already existing 1 hour backoffs | ||
|
||
- Ensure that [`cmctl renew`](https://cert-manager.io/docs/usage/cmctl/#renew) can still be used to manually force an immediate issuance attempt, so that in cases where the issuance was failing due to a setup error (i.e DNS setup) and a user believes that they have fixed it, they have a way to verify the fix without waiting up to 32h | ||
|
||
### Non-Goals | ||
|
||
- Introduce backoff period that is shorter than the current static 1 hour backoff period to allow for issuance of short lived certs to be retried as that is a separate concern from backing off exponentially and is complex enough to be worked on separately | ||
|
||
- Make the backoff period configurable as this would add a lot of extra complexity. For context Kubernetes pod crashloopbackoff period is _not_ configurable (although it is a very demanded feature [k/k#57291](https://github.com/kubernetes/kubernetes/issues/57291)) | ||
|
||
- Make it possible to reset the backoff period (However, it would be possible to force re-issuance to be retried immediately using `cmctl renew` and, if that succeeded, the backoff would be reset) | ||
|
||
### Must Not | ||
|
||
- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after cert-manager controller restart | ||
|
||
- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after upgrading to a cert-manager version that implements exponential backoff | ||
|
||
|
||
## Proposal | ||
|
||
Exponential backoff will be implemented by exponentially increasing the delays between a failed issuance ([`Issuing` condition set to false in `certificates-issuing` controller](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L341)) and a new issuance ([`Issuing` condition set to true in `certificates-trigger` controller](https://github.com/jetstack/cert-manager/blob/d5503c2ed2df272ec1bd94ebd223408fad29df1f/pkg/controller/certificates/trigger/trigger_controller.go#L184)). From a user perspective, this will correspond to the delay between a `CertificateRequest` having failed and new `CertificateRequest`s being created. | ||
|
||
A new `IssuanceAttempts` status field will be added to `Certificate` that will be used to record the number of consecutive failed issuances. | ||
Similarly to [`status.LastFailureTime`](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L385-L391), `status.IssuanceAttempts` field will only be set for a `Certificate` whose issuance is currently failing and will be removed after a successful issuance. | ||
|
||
`IssuanceAttempts` will be set by [`certificates-issuing` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/issuing) after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, `certificates-issuing` controller will ensure that `status.IssuanceAttempts` is not set. | ||
|
||
The delay till the next issuance will then be calculated by [`certificates-trigger` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger) using the formula `if status.LastFailureTime != nil then next_issuance_attempt_time = status.LastFailureTime + time.Hour x 2 ^ (status.IssuanceAttempts- 1)` (binary exponential- so the sequence will be 1h, 2h, 4h, 8h etc). This ensures that the first delay is 1 hour from the last failure time which is the current behaviour. In case of continuous failures, the delay should keep increasing up to a maximum backoff period of 32h, after which it should be retried every 32h whilst the failures persist. | ||
|
||
### API changes | ||
|
||
A new `IssuanceAttempts` field will be added to `Certificate`'s status. | ||
|
||
``` | ||
type CertificateStatus { | ||
// EXISTING FIELDS | ||
// ... | ||
// NEW FIELDS | ||
// IssuanceAttempts represents the number of consecutive failed issuances. | ||
// This field is used to calculate the backoff period after which issuance will be attempted again. | ||
IssuanceAttempts int `json:issuanceAttempts,omitempty` | ||
} | ||
``` | ||
|
||
### Examples | ||
|
||
Large part of the these examples show what is already the _current_ behaviour, the only changes are the parts where `IssuanceAttempts` field is being managed and where the delay between issuances is calculated with an exponential backoff. | ||
|
||
#### Issuance fails then succeeds: | ||
|
||
1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row | ||
|
||
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the [`Issuing` condition](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L480-L495) to false ([here-ish](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L326-L351)) | ||
|
||
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161)) | ||
|
||
4. In 4 hours, `Certificate` gets reconciled again and `certificates-trigger` controller sets the `Issuing` condition to true. This time the `CertificateRequest` succeeds. | ||
|
||
5. `certificates-issuing` controller reconciles the `Certificate` and the succeeded `CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition | ||
|
||
6. `certificates-trigger` controller determines that backoff is not needed and re-queues `Certificate` to be renewed based on `status.RenewalTime` | ||
|
||
#### Manually triggered reissuance succeeds | ||
|
||
1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row | ||
|
||
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false | ||
|
||
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161)) | ||
|
||
4. User fixes the reason for failure (i.e some networking setup) and runs `cmctl renew <certificate-name>` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163) | ||
|
||
5. A new `CertificateRequest` is created and succeeds | ||
|
||
6. `certificates-issuing` controller reconciles the `Certificate` and the succeeded`CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition | ||
|
||
7. `certificates-trigger` controller parses the `Certificate`, determines that backoff is not needed and requeues `Certificate` to be renewed based on `status.RenewalTime` | ||
|
||
#### Manually triggered reissuance fails | ||
|
||
1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row. | ||
|
||
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false | ||
|
||
3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161)) | ||
|
||
4. User thinks that they have fixed the failure (i.e some networking setup) and runs `cmctl renew <certificate-name>` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163) | ||
|
||
5. A new `CertificateRequest` is created and fails again | ||
|
||
6. `certificates-issuing` controller reconciles the `Certificate` and the failed `CertificateRequest`, bumps `status.IssuanceAttempts` to 4, sets the `Issuing` condition to false and sets `status.LastFailureTime` to now | ||
|
||
7. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (4 - 1)`, so roughly in 8 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 8 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161)) | ||
|
||
|
||
#### Example certificate statuses | ||
|
||
(These examples are based on what the statuses already look like after a failed/succeeded issuance. The only change is the `issuanceAttempts` field) | ||
|
||
1. A `Certificate` where issuance has failed 3 times in a row: | ||
``` | ||
Status: | ||
Conditions: | ||
- LastTransitionTime: <timestamp> | ||
Message: <message> | ||
ObservedGeneration: 1 | ||
Reason: Ready | ||
Status: "True" | ||
Type: Ready | ||
- LastTransitionTime: <timestamp> | ||
Message: <message> | ||
ObservedGeneration: 1 | ||
Reason: Failed | ||
Status: "False" | ||
Type: Issuing # Issuing condition remains set, but false after a failed issuance | ||
NotAfter: <timestamp> | ||
NotBefore: <timestamp> | ||
RenewalTime: <timestamp> | ||
IssuanceAttempts: 3 | ||
LastFailureTime: <timestamp> # Last failed issuance (i.e when a `CertificateRequest` failed) | ||
Revision: 19 | ||
Events: | ||
``` | ||
|
||
2. A `Certificate` where the latest issuance succeeded and no issuances are being attempted now: | ||
``` | ||
Status: | ||
Conditions: | ||
- LastTransitionTime: <timestamp> | ||
Message: <message> | ||
ObservedGeneration: 1 | ||
Reason: Ready | ||
Status: "True" | ||
Type: Ready | ||
NotAfter: <timestamp> | ||
NotBefore: <timestamp> | ||
RenewalTime: <timestamp> | ||
Revision: 20 | ||
Events: | ||
``` | ||
|
||
### Test Plan | ||
|
||
The example flows described in [Examples](#Examples) and [Upgrading](#Upgrading) will be tested via integration tests ([similar to the current integration tests for certificates](https://github.com/jetstack/cert-manager/tree/master/test/integration/certificates)) | ||
|
||
### Upgrading | ||
|
||
Upgrading to a cert-manager version that implements exponential backoff or downgrading to one that does not, should not require any extra steps or cause unnecessary re-issuances. | ||
To ensure that `Certificate`s whose issuance is currently failing don't get renewed all at once after upgrading to cert-manager version that implements exponential backoff, `certificates-trigger` controller should fall back to 1 hour delay for all `Certificate`s that have `status.LastFailureTime` set, but don't have the `status.IssuanceAttempts` set. | ||
## Assumptions | ||
|
||
- Although work on this was prompted by wanting to limit calls to ACME servers, users of other types of issuers will benefit from it | ||
|
||
- 1h is an acceptable initial delay between issuances (keeping this to 1h ensures that at least for the first retry attempt, we keep the current behaviour, however perhaps for short lived certs it would be useful to start with a shorter initial delay?) | ||
|
||
- 32h is an acceptable maximum delay between issuances | ||
|
||
- In case of exponential backoff being applied, controller logs will be sufficient for users trying to debug this and understand when the next issuance will be attempted | ||
|
||
- Applying exponential backoff in cases where issuance fails due to a denied `CertificateRequest` should not be treated differently to other failures (so exponential backoff should be applied). Currently they are treated the same and retried after 1 hour, so this is consistent with the existing behaviour | ||
|
||
## Feature Gate | ||
|
||
The current assumption is that exponential backoff would _not_ be placed behind a feature gate, however this should be considered. | ||
|
||
Some reasons for putting it behind a feature gate: | ||
|
||
- Would the API fields added for this feature change? | ||
- Would the delay periods (min, max and how they are calculated) change and could this change be breaking? | ||
|
||
If the feature gate was to be implemented, it would mean adding a new flag to controller binary (i.e `--enable-exponential-backoff`) and adding some if-statements to `certificates-issuing` and `certificates-trigger` controllers to not add the `status.IssuanceAttempts` field and not parse it, unless the feature gate is enabled. (Question: would `IssuanceAttempts` still be added to the v1 `Certificate` API?) | ||
|
||
## Misc | ||
|
||
- Slack conversation about storing the current delay in memory vs using a status field https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1642178582273300 |