Skip to content

Monitoring gaps in apiserver extension mechanisms #117167

@shyamjvs

Description

@shyamjvs

What would you like to be added?

Following up from this discussion - #116420 (comment)

There are different components that fall in the critical path for the k8s API today (apiserver (core), authn/authz webhooks, mutating/validating/conversion webhooks, etcd, extension apiservers - maybe more I'm missing). While some of those do, bunch of them don't seem to have metrics tracking request/error counts and latency metrics. Here's what I found so far (will update as we learn more):

  • Authentication webhook
    • request/error counts (xref)
      request latencies (xref)
  • Authorization webhook
    • request/error counts (#117211)
      request latencies (#117211)
  • Admission webhooks (mutating/validating)
    • request/error counts (xref)
      request latencies (xref)
  • CRD Conversion webhook
    • request/error counts (#118292)
      request latencies (#118292)
  • Etcd
    • request/error counts (#117222)
      request latencies (xref)
  • Extension apiserver
    • request/error counts
      request latencies

Finally, wrt the apiserver itself, we measure request/error counts and these flavors of latency metrics today:

  • e2e latency (capturing customer experience)
    SLI-based latency (measuring cloud-provider QoS)
    We discussed adding a third flavor of latency metrics that measures only the apiserver "core" latency (not including any external callback/webhook mechanisms) here.
    We don't have consensus that benefits of doing so outweigh the additional metric churn. We can revisit later as TBD

Why is this needed?

Metrics at component/dependency level allow us to:

  • track each component's performance in isolation
  • set internal (non-customer-facing) SLOs for teams owning those components
  • narrow down the root-cause for API errors/latencies easily

/sig api-machinery
/sig auth
/sig scalability
/kind feature
/help

Activity

added
kind/featureCategorizes issue or PR as related to a new feature.
on Apr 8, 2023
k8s-ci-robot

k8s-ci-robot commented on Apr 8, 2023

@k8s-ci-robot
Contributor

@shyamjvs:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

What would you like to be added?

Following up from this discussion - #116420 (comment)

There are different components that fall in the critical path for the k8s API today (apiserver (core), authn/authz webhooks, mutating/validating/conversion webhooks, etcd, extension apiservers - maybe more I'm missing). While some of those do, bunch of them don't seem to have metrics tracking request/error counts and latency metrics. Here's what I found so far (will update as we learn more):

Finally, wrt the apiserver itself, we measure request/error counts and these flavors of latency metrics today:

  • e2e latency (capturing customer experience)
    SLI-based latency (measuring cloud-provider QoS)
    We discussed adding a third flavor of latency metrics that measures only the apiserver "core" latency (not including any external callback/webhook mechanisms) here. We don't yet have consensus that benefits of doing so outweigh the additional metric churn. We can revisit that later as TBD

Why is this needed?

Metrics at component/dependency level allow us to:

  • track each component's performance in isolation
  • set internal (non-customer-facing) SLOs for teams owning those components
  • narrow down the root-cause for API latencies easily

/sig api-machinery
/sig auth
/sig scalability
/kind feature
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

added
sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.
sig/authCategorizes an issue or PR as relevant to SIG Auth.
help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.
on Apr 8, 2023
added
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.
on Apr 8, 2023
shyamjvs

shyamjvs commented on Apr 8, 2023

@shyamjvs
MemberAuthor

cc @wojtek-t @lavalamp

Also please correct me if I misread any of the code.

HirazawaUi

HirazawaUi commented on Apr 8, 2023

@HirazawaUi
Contributor

/assign
I think I can modify the Authorization webhook section

my-git9

my-git9 commented on Apr 8, 2023

@my-git9
Member

/assign
I want work for CRD Conversion webhook section

54 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.kind/featureCategorizes issue or PR as related to a new feature.sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.sig/authCategorizes an issue or PR as relevant to SIG Auth.sig/instrumentationCategorizes an issue or PR as relevant to SIG Instrumentation.sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

  • Status

    In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Monitoring gaps in apiserver extension mechanisms · Issue #117167 · kubernetes/kubernetes