Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating feedback on 119341 #120087

Merged

Conversation

divyasri537
Copy link
Contributor

@divyasri537 divyasri537 commented Aug 21, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

Incorporating feedback on PR #119341

Which issue(s) this PR fixes:

Fixes #119341

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Incorporating feedback on PR #119341

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 21, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @divyasri537. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 21, 2023
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 21, 2023
@divyasri537 divyasri537 force-pushed the incorporating-feedback-PR-119341 branch from a415565 to e250232 Compare August 21, 2023 15:52
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 21, 2023
@divyasri537
Copy link
Contributor Author

/assign @liggitt

@@ -169,8 +169,7 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
switch err := err.(type) {
case *webhookutil.ErrCallingWebhook:
if ctx.Err() == context.Canceled {
klog.Warningf("Context Canceled when calling webhook %v", hook.Name)
return err
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to break here either, we probably want to flow through to the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added break statement here, so the metric counter in next steps wont pick the canceled context.

@@ -169,8 +169,7 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
switch err := err.(type) {
case *webhookutil.ErrCallingWebhook:
if ctx.Err() == context.Canceled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you look into checking err instead of ctx.Err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I checked and added the comments in the old PR #119341

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The err that is from callAttrMutatingHook method [ref] based on the provided ctx . The err.Reason returns customized error response that picks up the ctx error so I don’t think anything wrong by checking this way ctx.Err() == context.Canceled as it is similar to checking the err obj.

Also The err.Reason returns
%!(EXTRA string=failed to call webhook: Post "https://127.0.0.1:21375/validate?timeout=2s": context canceled) which is a customized error and may break if someone modifies the error message so I think its better to go with checking ctx.Err().

@@ -174,8 +174,7 @@ func (d *validatingDispatcher) Dispatch(ctx context.Context, attr admission.Attr
switch err := err.(type) {
case *webhookutil.ErrCallingWebhook:
if ctx.Err() == context.Canceled {
klog.Warningf("Context Canceled when calling webhook %v", hook.Name)
return
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing with break here

@ivelichkovich
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 21, 2023
@divyasri537
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 21, 2023
if !ignoreClientCallFailures {
rejected = true
admissionmetrics.Metrics.ObserveWebhookRejection(ctx, hook.Name, "validating", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionCallingWebhookError, int(err.Status.ErrStatus.Code))
if !errors.Is(err.Reason, context.Canceled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here add comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment, it doesn't only apply to failopen metrics

klog.Warningf("Context Canceled when calling webhook %v", hook.Name)
errCh <- apierrors.NewInternalError(err)
return
}
if ignoreClientCallFailures {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here just doing

errCh <- apierrors.NewInternalError(err)
return

is not necessarily the same behavior. The if ignoreClientCallFailures block uses utilruntime.HandlerError while the other block does the errCh <- apierrors.NewInternalError(err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see what liggitt thinks here.

Copy link
Contributor

@ivelichkovich ivelichkovich Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes sense to always do errCh <- apierrors.NewInternalError(err) for context cancelled but will defer to @liggitt

@@ -202,6 +201,10 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
}

if callErr, ok := err.(*webhookutil.ErrCallingWebhook); ok {
if errors.Is(callErr.Reason, context.Canceled) {
klog.Warningf("Context Canceled when calling webhook %v", hook.Name)
return apierrors.NewInternalError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, not necessarily the same behavior. The blocks for ignoreClientCallFailures and outside that block handle the error differently

Copy link
Contributor Author

@divyasri537 divyasri537 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utilruntime.HandleError(callErr) returns "failed calling webhook "eks-aws-auth-configmap-validation-webhook.amazonaws.com": failed to call webhook: Post "https://127.0.0.1:21375/mutate?timeout=2s": context canceled" as we wanted to put customized log msg for failopen context canceled so keeping this change and return apierrors.NewInternalError(err).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @liggitt I think if context is cancelled it makes sense to just return this error. Previous behavior was it would continue if ignoreClientCallFailures { but we can probably just return and bail out

@alexzielenski
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 22, 2023
@@ -169,6 +169,7 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
if err != nil {
switch err := err.(type) {
case *webhookutil.ErrCallingWebhook:
// Ignore context canceled from failopen metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't only apply to failopen metric

rejected = true
admissionmetrics.Metrics.ObserveWebhookRejection(ctx, hook.Name, "admit", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionCallingWebhookError, int(err.Status.ErrStatus.Code))
}
admissionmetrics.Metrics.ObserveWebhook(ctx, hook.Name, time.Since(t), rejected, versionedAttr.Attributes, "admit", int(err.Status.ErrStatus.Code))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipping the webhook rejection metric for cancelled requests seems ok, but do we really want to lose visibility to the fact that the webhook was called at all and the latency we hit for canceled requests?

I was hoping we could revert back to the state at 5e5b302 and narrowly skip the ObserveWebhookRejection and ObserveWebhookFailOpen/addFailedOpenAnnotation bits when the reason was a context cancel

something like this:

$ git diff -w 5e5b3029f3bbfc93c3569f07ad300a5c6057fc58 HEAD -- staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go
diff --git a/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go b/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go
index c1d1ca6ff6b..61b17932df1 100644
--- a/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go
+++ b/staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/dispatcher.go
@@ -20,6 +20,7 @@ package mutating
 
 import (
        "context"
+       "errors"
        "fmt"
        "time"
 
@@ -164,14 +165,18 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
                annotator := newWebhookAnnotator(versionedAttr, round, i, hook.Name, invocation.Webhook.GetConfigurationName())
                changed, err := a.callAttrMutatingHook(ctx, hook, invocation, versionedAttr, annotator, o, round, i)
                ignoreClientCallFailures := hook.FailurePolicy != nil && *hook.FailurePolicy == admissionregistrationv1.Ignore
+
                rejected := false
                if err != nil {
                        switch err := err.(type) {
                        case *webhookutil.ErrCallingWebhook:
                                if !ignoreClientCallFailures {
                                        rejected = true
+                                       // Ignore context cancelled from webhook metrics
+                                       if !errors.Is(err.Reason, context.Canceled) {
                                                admissionmetrics.Metrics.ObserveWebhookRejection(ctx, hook.Name, "admit", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionCallingWebhookError, int(err.Status.ErrStatus.Code))
                                        }
+                               }
                                admissionmetrics.Metrics.ObserveWebhook(ctx, hook.Name, time.Since(t), rejected, versionedAttr.Attributes, "admit", int(err.Status.ErrStatus.Code))
                        case *webhookutil.ErrWebhookRejection:
                                rejected = true
@@ -199,9 +204,14 @@ func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attrib
 
                if callErr, ok := err.(*webhookutil.ErrCallingWebhook); ok {
                        if ignoreClientCallFailures {
+                               // Ignore context cancelled from webhook metrics
+                               if errors.Is(callErr.Reason, context.Canceled) {
+                                       klog.Warningf("Context canceled when calling webhook %v", hook.Name)
+                               } else {
                                        klog.Warningf("Failed calling webhook, failing open %v: %v", hook.Name, callErr)
                                        admissionmetrics.Metrics.ObserveWebhookFailOpen(ctx, hook.Name, "admit")
                                        annotator.addFailedOpenAnnotation()
+                               }
 
                                utilruntime.HandleError(callErr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same comments apply to the validating dispatcher)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some feedback though that we wanted the math to add up in the metrics correctly. #119341 (comment) so that total = success + failure. So we were looking at skipping the metrics or adding a new metric that counts cancelled. #119341 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're going to be backporting this, we should err on the side of fewer changes, imo. narrowly skipping the misleading rejection/failopen bits for cancelled requests and changing ~nothing else w.r.t. return values makes sense for a backport

if we want to add a new metric or label or label value to distinguish cancellation as a follow up in 1.29+ only, that also seems ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense

@divyasri537 divyasri537 force-pushed the incorporating-feedback-PR-119341 branch from 27fa66e to e9459e3 Compare August 23, 2023 19:17
@ivelichkovich
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3cf1a9f2fd98141cbdeb8027e961d9700c3666ac

@liggitt
Copy link
Member

liggitt commented Aug 23, 2023

/lgtm
/approve

@liggitt
Copy link
Member

liggitt commented Aug 23, 2023

/hold

please squash down to a single commit before merge

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: divyasri537, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 23, 2023
@divyasri537 divyasri537 force-pushed the incorporating-feedback-PR-119341 branch from e9459e3 to 67d35ce Compare August 23, 2023 19:51
@divyasri537
Copy link
Contributor Author

/hold

please squash down to a single commit before merge

@liggitt Squashed to one commit.

@divyasri537 divyasri537 changed the title Incorporating feedback on PR #119341 Incorporating feedback on 119341 Aug 23, 2023
@divyasri537 divyasri537 force-pushed the incorporating-feedback-PR-119341 branch from 67d35ce to 24877f9 Compare August 23, 2023 20:24
@liggitt
Copy link
Member

liggitt commented Aug 23, 2023

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2023
@divyasri537
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 8290c4c into kubernetes:master Aug 23, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Aug 23, 2023
k8s-ci-robot added a commit that referenced this pull request Aug 25, 2023
…#120087-upstream-release-1.28

Automated cherry pick of #120087: Incorporating feedback on 119341
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants