From 184a16da10a7762c4aeec11eb42c2088c1b0360b Mon Sep 17 00:00:00 2001
From: irbekrm
Date: Tue, 18 Jan 2022 09:09:58 +0000
Subject: [PATCH 1/5] Design for exponential backoff
Signed-off-by: irbekrm
---
...ertificate-issuance-exponential-backoff.md | 231 ++++++++++++++++++
1 file changed, 231 insertions(+)
create mode 100644 design/20220118.certificate-issuance-exponential-backoff.md
diff --git a/design/20220118.certificate-issuance-exponential-backoff.md b/design/20220118.certificate-issuance-exponential-backoff.md
new file mode 100644
index 00000000000..b1ccb69ac96
--- /dev/null
+++ b/design/20220118.certificate-issuance-exponential-backoff.md
@@ -0,0 +1,231 @@
+---
+title: Exponential Backoff for Certificate Issuance
+authors:
+ - "@irbekrm"
+reviewers:
+ - @jetstack/team-cert-manager
+approvers:
+ - @jetstack/team-cert-manager
+editor: "@irbekrm"
+creation-date: 2022-01-18
+last-updated: 2022-01-19
+status: implementable
+---
+
+# Exponential Backoff for Certificate Issuance
+
+## Table of Contents
+
+
+- [Summary](#summary)
+- [Motivation](#motivation)
+ * [Goals](#goals)
+ * [Non-Goals](#non-goals)
+ * [Must Not](#must-not)
+- [Proposal](#proposal)
+ * [API Changes](#api-changes)
+ * [Examples](#examples)
+ - [Issuance fails then succeeds](#Issuance-fails-then-succeeds)
+ - [Manually triggered reissuance succeeds](#Manually-triggered-reissuance-succeeds)
+ - [Manually triggered reissuance fails](#Manually-triggered-reissuance-fails)
+ - [Example certificate statuses](#example-certificate-statuses)
+ * [Upgrading](#upgrading)
+ * [Test Plan](#test-plan)
+ * [Assumptions](#assumptions)
+ * [Feature Gate](#feature-gate)
+ * [Misc](#misc)
+
+
+## Summary
+
+Implement a way to apply exponential backoff when retrying failed certificate
+issuances. Issuance in this context refers to the period of time during which
+the `Issuing` condition on a `Certificate` is set to true and for which a new
+set of issuance-specific resources (`CertificateRequest`s, `Order`s, `Challenge`s etc) is
+created.
+
+## Motivation
+
+Currently failed issuances are retried once an hour without a backoff or time limit. This means that 1) continuous failures in large installations can overwhelm external services 2) rate limits can be easily hit in case of longer lasting issuance problems (see [Let'sEncrypt rate limts](https://letsencrypt.org/docs/rate-limits/))
+
+### Goals
+
+- Ensure that retrying failed issuances does not overwhelm external services and is less likely to hit rate limits by adding exponentially increasing delays between issuance retries
+
+- Ensure that when the backoff is being applied, users have a way to find out when the issuance will be next retried and that the backoff mechanism does not introduce extra complexity for debugging and fits in with the already existing 1 hour backoffs
+
+- Ensure that [`cmctl renew`](https://cert-manager.io/docs/usage/cmctl/#renew) can still be used to manually force an immediate issuance attempt, so that in cases where the issuance was failing due to a setup error (i.e DNS setup) and a user believes that they have fixed it, they have a way to verify the fix without waiting up to 32h
+
+### Non-Goals
+
+- Make the backoff period configurable as this would add a lot of extra complexity. For context Kubernetes pod crashloopbackoff period is _not_ configurable (although it is a very demanded feature [k/k#57291](https://github.com/kubernetes/kubernetes/issues/57291))
+
+- Make it possible to reset the backoff period (However, it would be possible to force re-issuance to be retried immediately using `cmctl renew` and, if that succeeded, the backoff would be reset)
+
+### Must Not
+
+- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after cert-manager controller restart
+
+- Cause all `Certificate`s whose issuances are currently failing to be re-issued at once after upgrading to a cert-manager version that implements exponential backoff
+
+
+## Proposal
+
+Exponential backoff will be implemented by exponentially increasing the delays between a failed issuance ([`Issuing` condition set to false in `certificates-issuing` controller](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L341)) and a new issuance ([`Issuing` condition set to true in `certificates-trigger` controller](https://github.com/jetstack/cert-manager/blob/d5503c2ed2df272ec1bd94ebd223408fad29df1f/pkg/controller/certificates/trigger/trigger_controller.go#L184)). From a user perspective, this will correspond to the delay between a `CertificateRequest` having failed and new `CertificateRequest`s being created.
+
+A new `IssuanceAttempts` status field will be added to `Certificate` that will be used to record the number of failed issuances for this [revision](https://github.com/jetstack/cert-manager/blob/5019aaacfcaa4da2cba32218fdf583eede00f224/pkg/apis/certmanager/v1/types_certificate.go#L411-L426).
+Similarly to [`status.LastFailureTime`](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L385-L391), `status.IssuanceAttempts` field will only be set for a `Certificate` whose issuance is currently failing and will be removed after a successful issuance.
+
+`IssuanceAttempts` will be set by [`certificates-issuing` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/issuing) after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, `certificates-issuing` controller will ensure that `status.IssuanceAttempts` is not set.
+
+The delay till the next issuance will then be calculated by [`certificates-trigger` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger) using the formula `if status.LastFailureTime != nil then next_issuance_attempt_time = status.LastFailureTime + time.Hour x 2 ^ (status.IssuanceAttempts- 1)` (binary exponential- so the sequence will be 1h, 2h, 4h, 8h etc). This ensures that the first delay is 1 hour from the last failure time which is the current behaviour. In case of continuous failures, the delay should keep increasing up to a maximum backoff period of 32h, after which it should be retried every 32h whilst the failures persist.
+`certificates-trigger` controller should also throw an event on the `Certificate` with a message that includes the calculated delay period.
+
+### API changes
+
+A new `IssuanceAttempts` field will be added to `Certificate`'s status.
+
+```
+type CertificateStatus {
+ // EXISTING FIELDS
+ // ...
+ // NEW FIELDS
+
+ // IssuanceAttempts represents the number of times certificate issuance has been attempted and failed during this revision.
+ // This field is used to calculate the backoff period after which issuance will be attempted again.
+ IssuanceAttempts int `json:issuanceAttempts,omitempty`
+}
+```
+
+### Examples
+
+Large part of the these examples show what is already the _current_ behaviour, the only changes are the parts where `IssuanceAttempts` field is being managed and where the delay between issuances is calculated with an exponential backoff.
+
+#### Issuance fails then succeeds:
+
+1. A `CertificateRequest` with [revision number](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L409-L424) 20 fails. This is the 3rd failed issuance for this revision
+
+2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the [`Issuing` condition](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L480-L495) to false ([here-ish](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L326-L351))
+
+3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be status.LastFailureTime + 2h ^ (3 - 1), so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
+
+4. In 4 hours, `Certificate` gets reconciled again and `certificates-trigger` controller sets the `Issuing` condition to true. This time the `CertificateRequest` succeeds.
+
+5. `certificates-issuing` controller reconciles the `Certificate` and the succeeded `CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition
+
+6. `certificates-trigger` controller determines that backoff is not needed and re-queues `Certificate` to be renewed based on `status.RenewalTime`
+
+#### Manually triggered reissuance succeeds
+
+1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
+
+2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
+
+3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
+
+4. User fixes the reason for failure (i.e some networking setup) and runs `cmctl renew ` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163)
+
+5. A new `CertificateRequest` is created and succeeds
+
+6. `certificates-issuing` controller reconciles the `Certificate` and the succeeded`CertificateRequest` and removes the `status.IssuanceAttempts` as well as `status.LastFailureTime` and `Issuing` condition
+
+7. `certificates-trigger` controller parses the `Certificate`, determines that backoff is not needed and requeues `Certificate` to be renewed based on `status.RenewalTime`
+
+#### Manually triggered reissuance fails
+
+1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
+
+2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
+
+3. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (3 - 1)`, so roughly in 4 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 4 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
+
+4. User thinks that they have fixed the failure (i.e some networking setup) and runs `cmctl renew ` to force immediate re-issuance, which [adds `Issuing` condition to the `Certificate`](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/cmd/ctl/pkg/renew/renew.go#L203) thus signalling the other controllers that issuance is in progress and bypassing the `certificates-issuing` controller's [check for whether a backoff is needed](https://github.com/jetstack/cert-manager/blob/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger/trigger_controller.go#L158-L163)
+
+5. A new `CertificateRequest` is created and fails again
+
+6. `certificates-issuing` controller reconciles the `Certificate` and the failed `CertificateRequest`, bumps `status.IssuanceAttempts` to 4, sets the `Issuing` condition to false and sets `status.LastFailureTime` to now
+
+7. `certificates-trigger` controller parses the `Certificate` with the false `Issuing` condition, calculates the backoff period (in this case it will be `status.LastFailureTime + 2h ^ (4 - 1)`, so roughly in 8 hours) [here-ish](https://github.com/jetstack/cert-manager/blob/8dc603e7f5ef64288478b2e7a769a5415ae54ab0/pkg/controller/certificates/trigger/trigger_controller.go#L201) and enqueues the `Certificate` to be reconciled in 8 hours ([here](https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/trigger/trigger_controller.go#L161))
+
+
+#### Example certificate statuses
+
+(These examples are based on what the statuses already look like after a failed/succeeded issuance. The only change is the `issuanceAttempts` field)
+
+1. A `Certificate` where issuance has failed 3 times for revision 20:
+```
+Status:
+ Conditions:
+ - LastTransitionTime:
+ Message:
+ ObservedGeneration: 1
+ Reason: Ready
+ Status: "True"
+ Type: Ready
+ - LastTransitionTime:
+ Message:
+ ObservedGeneration: 1
+ Reason: Failed
+ Status: "False"
+ Type: Issuing # Issuing condition remains set, but false after a failed issuance
+ NotAfter:
+ NotBefore:
+ RenewalTime:
+ IssuanceAttempts: 3
+ LastFailureTime: # Last failed issuance (i.e when a `CertificateRequest` failed)
+ Revision: 19 # The latest succeeded revision. Currently we are attempting revision 20
+Events:
+ Type Reason Age From Message
+ ---- ------ ---- ---- -------
+ Warning IssuanceFailed 4s cert-manager Certificate issuance has failed. It will be retried in 4 hours.
+ ```
+
+2. A `Certificate` where the latest issuance for revision 20 succeeded and no issuances are being attempted now:
+```
+Status:
+ Conditions:
+ - LastTransitionTime:
+ Message:
+ ObservedGeneration: 1
+ Reason: Ready
+ Status: "True"
+ Type: Ready
+ NotAfter:
+ NotBefore:
+ RenewalTime:
+ Revision: 20
+Events:
+```
+
+### Test Plan
+
+The example flows described in [Examples](#Examples) and [Upgrading](#Upgrading) will be tested via integration tests ([similar to the current integration tests for certificates](https://github.com/jetstack/cert-manager/tree/master/test/integration/certificates))
+
+### Upgrading
+
+Upgrading to a cert-manager version that implements exponential backoff or downgrading to one that does not, should not require any extra steps or cause unnecessary re-issuances.
+To ensure that `Certificate`s whose issuance is currently failing don't get renewed all at once after upgrading to cert-manager version that implements exponential backoff, `certificates-trigger` controller should fall back to 1 hour delay for all `Certificate`s that have `status.LastFailureTime` set, but don't have the `status.IssuanceAttempts` set.
+## Assumptions
+
+- Although work on this was prompted by wanting to limit calls to ACME servers, users of other types of issuers will benefit from it
+
+- 1h is an acceptable initial delay between issuances (keeping this to 1h ensures that at least for the first retry attempt, we keep the current behaviour, however perhaps for short lived certs it would be useful to start with a shorter initial delay?)
+
+- 32h is an acceptable maximum delay between issuances
+
+- In case of exponential backoff being applied, an event on the `Certificate` as well as controller logs will be sufficient for users trying to debug this
+
+## Feature Gate
+
+The current assumption is that exponential backoff would _not_ be placed behind a feature gate, however this should be considered.
+
+Some reasons for putting it behind a feature gate:
+
+- Would the API fields added for this feature change?
+- Would the delay periods (min, max and how they are calculated) change and could this change be breaking?
+
+If the feature gate was to be implemented, it would mean adding a new flag to controller binary (i.e `--enable-exponential-backoff`) and adding some if-statements to `certificates-issuing` and `certificates-trigger` controllers to not add the `status.IssuanceAttempts` field and not parse it, unless the feature gate is enabled. (Question: would `IssuanceAttempts` still be added to the v1 `Certificate` API?)
+
+## Misc
+
+- Slack conversation about storing the current delay in memory vs using a status field https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1642178582273300
From 93cf3f7a2aa9d6dc9e316ba75f6cde8e090b64a2 Mon Sep 17 00:00:00 2001
From: irbekrm
Date: Wed, 26 Jan 2022 16:39:49 +0000
Subject: [PATCH 2/5] Changing revision number should not reset backoff period
There is no particular reason why it should and applying backoff to all consecutive failed issuances makes this easier to reason about
Signed-off-by: irbekrm
---
...8.certificate-issuance-exponential-backoff.md | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/design/20220118.certificate-issuance-exponential-backoff.md b/design/20220118.certificate-issuance-exponential-backoff.md
index b1ccb69ac96..d63a2f8e653 100644
--- a/design/20220118.certificate-issuance-exponential-backoff.md
+++ b/design/20220118.certificate-issuance-exponential-backoff.md
@@ -73,7 +73,7 @@ Currently failed issuances are retried once an hour without a backoff or time li
Exponential backoff will be implemented by exponentially increasing the delays between a failed issuance ([`Issuing` condition set to false in `certificates-issuing` controller](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L341)) and a new issuance ([`Issuing` condition set to true in `certificates-trigger` controller](https://github.com/jetstack/cert-manager/blob/d5503c2ed2df272ec1bd94ebd223408fad29df1f/pkg/controller/certificates/trigger/trigger_controller.go#L184)). From a user perspective, this will correspond to the delay between a `CertificateRequest` having failed and new `CertificateRequest`s being created.
-A new `IssuanceAttempts` status field will be added to `Certificate` that will be used to record the number of failed issuances for this [revision](https://github.com/jetstack/cert-manager/blob/5019aaacfcaa4da2cba32218fdf583eede00f224/pkg/apis/certmanager/v1/types_certificate.go#L411-L426).
+A new `IssuanceAttempts` status field will be added to `Certificate` that will be used to record the number of consecutive failed issuances.
Similarly to [`status.LastFailureTime`](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L385-L391), `status.IssuanceAttempts` field will only be set for a `Certificate` whose issuance is currently failing and will be removed after a successful issuance.
`IssuanceAttempts` will be set by [`certificates-issuing` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/issuing) after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, `certificates-issuing` controller will ensure that `status.IssuanceAttempts` is not set.
@@ -91,7 +91,7 @@ type CertificateStatus {
// ...
// NEW FIELDS
- // IssuanceAttempts represents the number of times certificate issuance has been attempted and failed during this revision.
+ // IssuanceAttempts represents the number of consecutive failed issuances.
// This field is used to calculate the backoff period after which issuance will be attempted again.
IssuanceAttempts int `json:issuanceAttempts,omitempty`
}
@@ -103,7 +103,7 @@ Large part of the these examples show what is already the _current_ behaviour, t
#### Issuance fails then succeeds:
-1. A `CertificateRequest` with [revision number](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L409-L424) 20 fails. This is the 3rd failed issuance for this revision
+1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the [`Issuing` condition](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/apis/certmanager/v1/types_certificate.go#L480-L495) to false ([here-ish](https://github.com/jetstack/cert-manager/blob/196d0011ca46037186a826365bcd6316d9b9462a/pkg/controller/certificates/issuing/issuing_controller.go#L326-L351))
@@ -117,7 +117,7 @@ Large part of the these examples show what is already the _current_ behaviour, t
#### Manually triggered reissuance succeeds
-1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
+1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
@@ -133,7 +133,7 @@ Large part of the these examples show what is already the _current_ behaviour, t
#### Manually triggered reissuance fails
-1. A `CertificateRequest` with revision 20 fails. This is the 3rd failed issuance for revision 20
+1. A `CertificateRequest` fails. This is the 3rd failed issuance in a row.
2. `certificates-issuing` controller reconciles the failed `CertificateRequest`, bumps the `status.IssuanceAttempts` by 1 as well as updating the `status.LastFailureTime` to the time when `CertificateRequest` failed and setting the `Issuing` condition to false
@@ -152,7 +152,7 @@ Large part of the these examples show what is already the _current_ behaviour, t
(These examples are based on what the statuses already look like after a failed/succeeded issuance. The only change is the `issuanceAttempts` field)
-1. A `Certificate` where issuance has failed 3 times for revision 20:
+1. A `Certificate` where issuance has failed 3 times in a row:
```
Status:
Conditions:
@@ -173,14 +173,14 @@ Status:
RenewalTime:
IssuanceAttempts: 3
LastFailureTime: # Last failed issuance (i.e when a `CertificateRequest` failed)
- Revision: 19 # The latest succeeded revision. Currently we are attempting revision 20
+ Revision: 19
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning IssuanceFailed 4s cert-manager Certificate issuance has failed. It will be retried in 4 hours.
```
-2. A `Certificate` where the latest issuance for revision 20 succeeded and no issuances are being attempted now:
+2. A `Certificate` where the latest issuance succeeded and no issuances are being attempted now:
```
Status:
Conditions:
From 476a22b674c99651a0e1942bbae424565212e194 Mon Sep 17 00:00:00 2001
From: irbekrm
Date: Wed, 26 Jan 2022 16:44:07 +0000
Subject: [PATCH 3/5] Remove reference to event that shows when the next
issuance will be attempted
The event did not seem that useful. Manually triggered (and failed reissuance attempt can result in multiple events on a Certificate each showing different renewal timestamp which seems confusing. Also events are short lived, so not that useful to someone debugging few hours after failure
Signed-off-by: irbekrm
---
design/20220118.certificate-issuance-exponential-backoff.md | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
diff --git a/design/20220118.certificate-issuance-exponential-backoff.md b/design/20220118.certificate-issuance-exponential-backoff.md
index d63a2f8e653..4efa01852d7 100644
--- a/design/20220118.certificate-issuance-exponential-backoff.md
+++ b/design/20220118.certificate-issuance-exponential-backoff.md
@@ -79,7 +79,6 @@ Similarly to [`status.LastFailureTime`](https://github.com/jetstack/cert-manager
`IssuanceAttempts` will be set by [`certificates-issuing` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/issuing) after a failed issuance by either bumping the already existing value by 1 or setting it to 1 (first failure). In case of a succeeded issuance, `certificates-issuing` controller will ensure that `status.IssuanceAttempts` is not set.
The delay till the next issuance will then be calculated by [`certificates-trigger` controller](https://github.com/jetstack/cert-manager/tree/ce1424162ea4f363bdb7aa4f201432ec63da1145/pkg/controller/certificates/trigger) using the formula `if status.LastFailureTime != nil then next_issuance_attempt_time = status.LastFailureTime + time.Hour x 2 ^ (status.IssuanceAttempts- 1)` (binary exponential- so the sequence will be 1h, 2h, 4h, 8h etc). This ensures that the first delay is 1 hour from the last failure time which is the current behaviour. In case of continuous failures, the delay should keep increasing up to a maximum backoff period of 32h, after which it should be retried every 32h whilst the failures persist.
-`certificates-trigger` controller should also throw an event on the `Certificate` with a message that includes the calculated delay period.
### API changes
@@ -175,9 +174,6 @@ Status:
LastFailureTime: # Last failed issuance (i.e when a `CertificateRequest` failed)
Revision: 19
Events:
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Warning IssuanceFailed 4s cert-manager Certificate issuance has failed. It will be retried in 4 hours.
```
2. A `Certificate` where the latest issuance succeeded and no issuances are being attempted now:
@@ -213,7 +209,7 @@ To ensure that `Certificate`s whose issuance is currently failing don't get rene
- 32h is an acceptable maximum delay between issuances
-- In case of exponential backoff being applied, an event on the `Certificate` as well as controller logs will be sufficient for users trying to debug this
+- In case of exponential backoff being applied, controller logs will be sufficient for users trying to debug this and understand when the next issuance will be attempted
## Feature Gate
From ae37a6ecbb960e1fa407e09184253f75e6ce79c5 Mon Sep 17 00:00:00 2001
From: irbekrm
Date: Wed, 26 Jan 2022 16:48:34 +0000
Subject: [PATCH 4/5] Adds a note about denied CertificateRequests
Signed-off-by: irbekrm
---
design/20220118.certificate-issuance-exponential-backoff.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/design/20220118.certificate-issuance-exponential-backoff.md b/design/20220118.certificate-issuance-exponential-backoff.md
index 4efa01852d7..6f55623900c 100644
--- a/design/20220118.certificate-issuance-exponential-backoff.md
+++ b/design/20220118.certificate-issuance-exponential-backoff.md
@@ -211,6 +211,8 @@ To ensure that `Certificate`s whose issuance is currently failing don't get rene
- In case of exponential backoff being applied, controller logs will be sufficient for users trying to debug this and understand when the next issuance will be attempted
+- Applying exponential backoff in cases where issuance fails due to a denied `CertificateRequest` should not be treated differently to other failures (so exponential backoff should be applied). Currently they are treated the same and retried after 1 hour, so this is consistent with the existing behaviour
+
## Feature Gate
The current assumption is that exponential backoff would _not_ be placed behind a feature gate, however this should be considered.
From d7ccd6dd511bca72f90defd0482910e0bef2ff51 Mon Sep 17 00:00:00 2001
From: irbekrm
Date: Mon, 7 Feb 2022 11:00:48 +0000
Subject: [PATCH 5/5] Reducing initial backoff period is a non-goal
Signed-off-by: irbekrm
---
design/20220118.certificate-issuance-exponential-backoff.md | 2 ++
1 file changed, 2 insertions(+)
diff --git a/design/20220118.certificate-issuance-exponential-backoff.md b/design/20220118.certificate-issuance-exponential-backoff.md
index 6f55623900c..aebb8a68907 100644
--- a/design/20220118.certificate-issuance-exponential-backoff.md
+++ b/design/20220118.certificate-issuance-exponential-backoff.md
@@ -58,6 +58,8 @@ Currently failed issuances are retried once an hour without a backoff or time li
### Non-Goals
+- Introduce backoff period that is shorter than the current static 1 hour backoff period to allow for issuance of short lived certs to be retried as that is a separate concern from backing off exponentially and is complex enough to be worked on separately
+
- Make the backoff period configurable as this would add a lot of extra complexity. For context Kubernetes pod crashloopbackoff period is _not_ configurable (although it is a very demanded feature [k/k#57291](https://github.com/kubernetes/kubernetes/issues/57291))
- Make it possible to reset the backoff period (However, it would be possible to force re-issuance to be retried immediately using `cmctl renew` and, if that succeeded, the backoff would be reset)