Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always retry connection reset error in webhook #53947

Merged

Conversation

crassirostris
Copy link

@crassirostris crassirostris commented Oct 15, 2017

Fixes #52909

Audit logging uses webhook to send events to the backend and currently even a little blip in networking can cause several hundreds of events to be lost. This PR adds an additional check, that is similar to the one in the rest package, but ignores the fact that the request is not GET and always retries "Connection reset by peers" error.

Webhook always retries connection reset error.

@crassirostris crassirostris added area/audit kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Oct 15, 2017
@crassirostris crassirostris added this to the v1.9 milestone Oct 15, 2017
@crassirostris crassirostris requested a review from sttts October 15, 2017 14:58
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 15, 2017
@crassirostris crassirostris force-pushed the retry-webhook-net-errors branch from b4ebbf4 to 40161ed Compare October 15, 2017 14:59
@@ -112,6 +113,11 @@ func WithExponentialBackoff(initialBackoff time.Duration, webhookFn func() error
if _, shouldRetry := apierrors.SuggestsClientDelay(err); shouldRetry {
return false, nil
}
// "Connection reset by peer" is usually a transient error, but it's not
// retried by the rest client because the request is not GET.
if net.IsConnectionReset(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this move up to the first if-clause?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move IsConnectionReset and also IsInternalError up. Both are unconditional, as the others. Then for all other errors we call SuggestsClientDelay. Now in the current shape it looks like the order of the tests matter. But it doesn't.

@crassirostris
Copy link
Author

crassirostris commented Oct 16, 2017 via email

Signed-off-by: Mik Vyatskov <vmik@google.com>
@crassirostris crassirostris force-pushed the retry-webhook-net-errors branch from 40161ed to 59bacba Compare October 16, 2017 10:26
@crassirostris
Copy link
Author

@sttts PTAL

I removed the IsServerTimeout check, b/c it's included in the SuggestsClientDelay function

@sttts
Copy link
Contributor

sttts commented Oct 16, 2017

Which other webhook does this apply to? Admission? Authorization?

/cc @liggitt

@k8s-ci-robot k8s-ci-robot requested a review from liggitt October 16, 2017 10:41
@crassirostris
Copy link
Author

crassirostris commented Oct 16, 2017

Which other webhook does this apply to? Admission? Authorization?

All of them, yes. But I think this change makes sense for all webhooks

@crassirostris
Copy link
Author

/retest

@@ -104,17 +105,14 @@ func WithExponentialBackoff(initialBackoff time.Duration, webhookFn func() error
var err error
wait.ExponentialBackoff(backoff, func() (bool, error) {
err = webhookFn()
// these errors indicate a need to retry an authentication check
if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove IsServerTimeout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's already checked in SuggestsClientDelay. I can return it, if you want the check to be more explicit

// these errors indicate a need to retry an authentication check
if apierrors.IsServerTimeout(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) {
// these errors indicate a transient error that should be retried.
if net.IsConnectionReset(err) || apierrors.IsInternalError(err) || apierrors.IsTimeout(err) || apierrors.IsTooManyRequests(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't client-go already turn connection errors into internal server errors?

if err != nil {
// "Connection reset by peer" is usually a transient error.
// Thus in case of "GET" operations, we simply retry it.
// We are not automatically retrying "write" operations, as
// they are not idempotent.
if !net.IsConnectionReset(err) || r.verb != "GET" {
return err
}
// For the purpose of retry, we set the artificial "retry-after" response.
// TODO: Should we clean the original response if it exists?
resp = &http.Response{
StatusCode: http.StatusInternalServerError,
Header: http.Header{"Retry-After": []string{"1"}},
Body: ioutil.NopCloser(bytes.NewReader([]byte{})),
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client-go only retries on GET calls because it doesn't want to replay non-idempotent calls. this code should not assume all webhook calls are idempotent, either. seems like it would be better to be able to indicate a request is idempotent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if it's a GET, but e.g. request with audit logs is POST

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right... this shouldn't assume all webhook calls are idempotent

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing audit logs is not an idempotent call, but it's still better to write audit logs twice than loose, esp. considering that audit logs have ids in it, that allow deduplication

seems like it would be better to be able to indicate a request is idempotent

Is there really a need to do so? We still cannot guarantee that a request hasn't been processed if e.g. a timeout is returned. Additionally, for all known to me webhooks (auth[nz], audit) I see no benefit in not retrying connection resets

Copy link
Author

@crassirostris crassirostris Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if we really care about a request being idempotent in this case, we shouldn't unconditionally retry e.g. 5XX. However, nobody complained about this code before, so I think it's safe to assume that retrying is more important than idempotence, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if we really care about a request being idempotent in this case, we shouldn't unconditionally retry e.g. 5XX. However, nobody complained about this code before, so I think it's safe to assume that retrying is more important than idempotence, WDYT?

Hmm, that's probably true

@crassirostris
Copy link
Author

@sttts @liggitt Do you want me to change something?

@ericchiang
Copy link
Contributor

ericchiang commented Oct 16, 2017

At this point, would it be easier to list the cases where we don't want to retry?

@crassirostris
Copy link
Author

@ericchiang

At this point, would it be easier to list the cases where we don't want to retry?

I'm not sure that is the right strategy: IMO it's better to be conservative. I.e. if some new not anticipated problem occurs, it's better to be as vocal about it as possible. For example, backend suddenly starts to reply with 418. It's better to notice this in monitoring and fix earlier, than retry each request hoping backend will stop acting as a teapot.

@sttts
Copy link
Contributor

sttts commented Oct 17, 2017

I am happy with the current shape, but lack the knowledge about the webhook context to forsee any side-effects. @liggitt any doubt this can go in?

@liggitt
Copy link
Member

liggitt commented Oct 17, 2017

I think this is ok

@sttts
Copy link
Contributor

sttts commented Oct 18, 2017

/lgtm
/approve no-issue

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crassirostris, sttts

Associated issue: 52909

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2017
@crassirostris crassirostris reopened this Oct 18, 2017
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 18, 2017

@crassirostris: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-gpu 59bacba link /test pull-kubernetes-e2e-gce-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 53958, 53947). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 14a1a15 into kubernetes:master Oct 18, 2017
k8s-github-robot pushed a commit that referenced this pull request Oct 19, 2017
…-#53947-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #53947

Cherry pick of #53947 on release-1.8.

#53947: Always retry network connection error in webhook

```release-note
Webhook always retries connection reset error.
```
@erictune erictune added area/admission-control and removed area/admission-control approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 26, 2017
@akotlar akotlar mentioned this pull request Dec 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/admission-control area/audit cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants