-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always retry connection reset error in webhook #53947
Always retry connection reset error in webhook #53947
Conversation
b4ebbf4
to
40161ed
Compare
@@ -112,6 +113,11 @@ func WithExponentialBackoff(initialBackoff time.Duration, webhookFn func() error | |||
if _, shouldRetry := apierrors.SuggestsClientDelay(err); shouldRetry { | |||
return false, nil | |||
} | |||
// "Connection reset by peer" is usually a transient error, but it's not | |||
// retried by the rest client because the request is not GET. | |||
if net.IsConnectionReset(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this move up to the first if-clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move IsConnectionReset and also IsInternalError up. Both are unconditional, as the others. Then for all other errors we call SuggestsClientDelay
. Now in the current shape it looks like the order of the tests matter. But it doesn't.
I feel that this is kind of special case, requiring a separate explanation.
This is not a delay, but an outage
…On Oct 16, 2017 10:07, "Dr. Stefan Schimanski" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In staging/src/k8s.io/apiserver/pkg/util/webhook/webhook.go
<#53947 (comment)>
:
> @@ -112,6 +113,11 @@ func WithExponentialBackoff(initialBackoff time.Duration, webhookFn func() error
if _, shouldRetry := apierrors.SuggestsClientDelay(err); shouldRetry {
return false, nil
}
+ // "Connection reset by peer" is usually a transient error, but it's not
+ // retried by the rest client because the request is not GET.
+ if net.IsConnectionReset(err) {
shouldn't this move up to the first if-clause?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#53947 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABrNWZaFqGkWOsSuEbHX_29TwS55pp68ks5ssw67gaJpZM4P5xku>
.
|
Signed-off-by: Mik Vyatskov <vmik@google.com>
40161ed
to
59bacba
Compare
@sttts PTAL I removed the |
Which other webhook does this apply to? Admission? Authorization? /cc @liggitt |
All of them, yes. But I think this change makes sense for all webhooks |
/retest |
@@ -104,17 +105,14 @@ func WithExponentialBackoff(initialBackoff time.Duration, webhookFn func() error | |||
var err error | |||
wait.ExponentialBackoff(backoff, func() (bool, error) { | |||
err = webhookFn() | |||
// these errors indicate a need to retry an authentication check | |||
if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove IsServerTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's already checked in SuggestsClientDelay
. I can return it, if you want the check to be more explicit
// these errors indicate a need to retry an authentication check | ||
if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) { | ||
// these errors indicate a transient error that should be retried. | ||
if net.IsConnectionReset(err) || apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't client-go already turn connection errors into internal server errors?
kubernetes/staging/src/k8s.io/client-go/rest/request.go
Lines 652 to 667 in 4a6fec7
if err != nil { | |
// "Connection reset by peer" is usually a transient error. | |
// Thus in case of "GET" operations, we simply retry it. | |
// We are not automatically retrying "write" operations, as | |
// they are not idempotent. | |
if !net.IsConnectionReset(err) || r.verb != "GET" { | |
return err | |
} | |
// For the purpose of retry, we set the artificial "retry-after" response. | |
// TODO: Should we clean the original response if it exists? | |
resp = &http.Response{ | |
StatusCode: http.StatusInternalServerError, | |
Header: http.Header{"Retry-After": []string{"1"}}, | |
Body: ioutil.NopCloser(bytes.NewReader([]byte{})), | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client-go only retries on GET calls because it doesn't want to replay non-idempotent calls. this code should not assume all webhook calls are idempotent, either. seems like it would be better to be able to indicate a request is idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if it's a GET
, but e.g. request with audit logs is POST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right... this shouldn't assume all webhook calls are idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing audit logs is not an idempotent call, but it's still better to write audit logs twice than loose, esp. considering that audit logs have ids in it, that allow deduplication
seems like it would be better to be able to indicate a request is idempotent
Is there really a need to do so? We still cannot guarantee that a request hasn't been processed if e.g. a timeout is returned. Additionally, for all known to me webhooks (auth[nz], audit) I see no benefit in not retrying connection resets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if we really care about a request being idempotent in this case, we shouldn't unconditionally retry e.g. 5XX. However, nobody complained about this code before, so I think it's safe to assume that retrying is more important than idempotence, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, if we really care about a request being idempotent in this case, we shouldn't unconditionally retry e.g. 5XX. However, nobody complained about this code before, so I think it's safe to assume that retrying is more important than idempotence, WDYT?
Hmm, that's probably true
At this point, would it be easier to list the cases where we don't want to retry? |
I'm not sure that is the right strategy: IMO it's better to be conservative. I.e. if some new not anticipated problem occurs, it's better to be as vocal about it as possible. For example, backend suddenly starts to reply with 418. It's better to notice this in monitoring and fix earlier, than retry each request hoping backend will stop acting as a teapot. |
I am happy with the current shape, but lack the knowledge about the webhook context to forsee any side-effects. @liggitt any doubt this can go in? |
I think this is ok |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: crassirostris, sttts Associated issue: 52909 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@crassirostris: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 53958, 53947). If you want to cherry-pick this change to another branch, please follow the instructions here. |
Fixes #52909
Audit logging uses webhook to send events to the backend and currently even a little blip in networking can cause several hundreds of events to be lost. This PR adds an additional check, that is similar to the one in the rest package, but ignores the fact that the request is not GET and always retries "Connection reset by peers" error.