Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate width of the request based on watchers count in P&F #103539

Merged
merged 2 commits into from
Oct 12, 2021

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Jul 7, 2021

/kind feature
/priority important-soon
/sig api-machinery

NONE

/assign @tkashem

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 7, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2021
@wojtek-t wojtek-t force-pushed the pf_estimate_watch_width branch from 3031130 to aff7c62 Compare July 7, 2021 10:55
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2021
@wojtek-t wojtek-t force-pushed the pf_estimate_watch_width branch from aff7c62 to 9183d04 Compare July 8, 2021 06:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 8, 2021
@wojtek-t wojtek-t force-pushed the pf_estimate_watch_width branch from 9183d04 to 1686697 Compare July 8, 2021 07:19
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 8, 2021
@wojtek-t wojtek-t force-pushed the pf_estimate_watch_width branch 4 times, most recently from 57ef6ba to 41f609d Compare July 8, 2021 10:14
@wojtek-t wojtek-t changed the title [WIP] Estimate width of the request based on watchers count in P&F Estimate width of the request based on watchers count in P&F Jul 8, 2021
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 12, 2021
@wojtek-t wojtek-t force-pushed the pf_estimate_watch_width branch from 7487bbf to c5a77d8 Compare October 12, 2021 07:54
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2021
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer - thanks for the review.
I applied your comments and I'm unholding, but it now needs re-lgtm

/hold cancel

Comment on lines 74 to 85
finalSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat))
additionalLatency := eventAdditionalDuration

// TODO: Remove this condition after we tune the algorithm better.
// Technically, there is an overhead connected to processing an event after
// the request finishes even if there is a small number of watches.
// However, until we tune the estimation we want to stay on the safe side
// an avoid introducing additional latency for almost every single request.
if watchCount < watchesPerSeat {
finalSeats = 0
additionalLatency = time.Duration(0)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -125,13 +126,18 @@ func (qs *queueSet) completeWorkEstimate(we *fcrequest.WorkEstimate) completedWo
return completedWorkEstimate{
WorkEstimate: *we,
totalWork: qs.computeTotalWork(we),
finalWork: qs.computeFinalWork(we),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021
return SeatsTimesDuration(float64(we.InitialSeats), qs.estimatedServiceDuration)
}

func (qs *queueSet) computeFinalWork(we *fcrequest.WorkEstimate) SeatSeconds {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a method of fcrequest.WorkEstimate.

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2021
@k8s-ci-robot k8s-ci-robot merged commit af2ed25 into kubernetes:master Oct 12, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Oct 12, 2021
@MikeSpreitzer
Copy link
Member

MikeSpreitzer commented Oct 13, 2021

I am looking at https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344 but am not familiar with how these tests work.

Also, BTW, why does the kube-apiserver log have two different forms of httplog lines for watches? One example is

I1012 21:07:32.046122      11 httplog.go:124] "HTTP" verb="GET" URI="/api/v1/namespaces/test-q781vv-1/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Daccount-53-11&resourceVersion=351459&timeout=7m13s&timeoutSeconds=433&watch=true" latency="1.721721704s" userAgent="kubemark/v1.23.0 (linux/amd64) kubernetes/af2ed25" audit-ID="efb7b8c4-d43b-477c-80e8-fd009ef4aabb" srcIP="10.64.5.26:35890" apf_pl="exempt" apf_fs="exempt" apf_fd="" apf_init_latency="296.835µs" resp=200

and another example is

I1012 21:07:31.983202      11 httplog.go:124] "HTTP" verb="WATCH" URI="/apis/storage.k8s.io/v1/csidrivers?allowWatchBookmarks=true&resourceVersion=327520&timeout=5m29s&timeoutSeconds=329&watch=true" latency="5m29.000606811s" userAgent="kubemark/v1.23.0 (linux/amd64) kubernetes/af2ed25" audit-ID="6aaa6416-4309-4097-a9c9-b6158845ba41" srcIP="10.64.7.46:55858" apf_pl="exempt" apf_fs="exempt" apf_fd="" apf_init_latency="303.958µs" resp=200

?

@MikeSpreitzer
Copy link
Member

The complaint seems to be (with added whitespace for readability):

[measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheusSimple error: top latency metric: there should be no high-latency requests, but: [
got: &{Resource:pods Subresource: Verb:POST Scope:resource Latency:perc50: 175.750111ms,
           perc90: 1.243873873s, perc99: 3.048313253s Count:18570 SlowCount:2401};
expected perc99 <= 1s

got: &{Resource:pods Subresource:binding Verb:POST Scope:resource Latency:perc50: 179.31098ms,
           perc90: 738.232044ms, perc99: 2.895361216s Count:18578 SlowCount:1343};
expected perc99 <= 1s

got: &{Resource:subjectaccessreviews Subresource: Verb:POST Scope:resource Latency:perc50: 35.03937ms,
          perc90: 144.285714ms, perc99: 1.554999999s Count:179 SlowCount:3};
expected perc99 <= 1s

got: &{Resource:events Subresource: Verb:PATCH Scope:resource Latency:perc50: 45.394736ms,
          perc90: 184.264705ms, perc99: 1.2575s Count:901 SlowCount:10};
expected perc99 <= 1s]]

I looked in the kube-apiserver log, and indeed there are some 3+ second latencies for POST to pods, almost entirely due to the APF filter. Examples:

I1012 20:42:18.811476 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.019339532s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="26ece8e6-7844-493b-a7fd-9a2d8bf29908" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.017415416s" resp=201

I1012 20:42:18.818386 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.01605656s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="97a9b2a1-9bb3-461f-9003-55316faa470d" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.014098725s" resp=201

@MikeSpreitzer
Copy link
Member

Looking at https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344/artifacts/MetricsForE2E_load_2021-10-12T21:02:06Z.json , I see

      {
        "metric": {
          "__name__": "apiserver_flowcontrol_priority_level_request_count_watermarks_bucket",
          "le": "0.5",
          "mark": "high",
          "phase": "executing",
          "priority_level": "workload-high"
        },
        "value": [
          0,
          "1722454"
        ]
      },

and all the higher buckets have the same value --- i.e., all the observations were less than or equal to 0.5. In other words, the workload-high priority level never used more than half of its allocated concurrency according to this metric. Now, to double-check what that metric really reports nowadays...

@MikeSpreitzer
Copy link
Member

Looking at the apiserver_flowcontrol_request_wait_duration_seconds_bucket metric for execute=true, flow_schema=kube-system-service-accounts, priority_level=workload-high, I see 128615 <= 1s, 130489 <= 2s, and 131226 <= 5s (and it tops out there).

@wojtek-t
Copy link
Member Author

Yeah - I think it's the culprit.
Let' me disable the main part with dropping the main part of the PR and we will work on renabling independently.

@MikeSpreitzer
Copy link
Member

How about just tweaking up the constants?
I am finding lots of things to improve in the metrics.

@wojtek-t
Copy link
Member Author

How about just tweaking up the constants?

We should do that, but it's more important to fix scalability tests and tweaking will take a bit of time.

I opened #105647 to disable that logic.

@MikeSpreitzer
Copy link
Member

I approved that PR.

I checked the apiserver_flowcontrol_priority_level_request_count_watermarks histogram and it accumulates observations of the fraction (number of executing requests) / (concurrency limit) --- which we know is now not so meaningful, dividing a number of requests by a number of seats.

@MikeSpreitzer
Copy link
Member

If we replace that histogram with one that accumulates observations of (number of seats occupied) / (concurrency limit) then we remove the reason I agreed to sink extra latency into QueueSet rather than just do it in the filter --- namely, the ability to have metrics that distinguish the two phases of execution. But it's more complicated, now that the extra latency can occupy more seats. BTW, I wonder if we goofed up something basic by making the virtual world track a different amount of work than what happens in the real world --- (total duration) times (seats occupied).

@wojtek-t
Copy link
Member Author

I checked the apiserver_flowcontrol_priority_level_request_count_watermarks histogram and it accumulates observations of the fraction (number of executing requests) / (concurrency limit) --- which we know is now not so meaningful, dividing a number of requests by a number of seats.

@MikeSpreitzer - do we believe this metric was useful even before? I'm not entirely sure how I would suppose to use this metric.

I agreed to sink extra latency into QueueSet rather than just do it in the filter --- namely, the ability to have metrics that distinguish the two phases of execution.

Thinking about that - I'm wondering what metrics would be useful. Things that I might be interested about:

  • in the last N minutes - how much time (per priority-level) was consumed by which phase
  • percentage of time (per priority-level) when some requests are waiting even though we're not (at least theoretically) occupying all seats with that PL.
    WDYT?

BTW, I wonder if we goofed up something basic by making the virtual world track a different amount of work than what happens in the real world --- (total duration) times (seats occupied).

Can you clarify? In real world we are doing both the "initial work" and "final work". We're accomodating for that in virtual world too (i.e. the r(t) function is progressing according to the sum, right?)

@MikeSpreitzer
Copy link
Member

Before generalizing width apiserver_flowcontrol_priority_level_request_count_watermarks histogram showed the extrema of occupancy. Observations of 1 showed that all seats were occupied. The apiserver_flowcontrol_priority_level_request_count_samples histogram showed not necessarily the extrema but a survey (even weighted over time). The watermarks histogram adds extrema that were missed in the sampling. I think they are meaningful even today, we just have to understand the denominator in the quotients. What we learned in this case is that the number of executing requests got as high as half of the concurrency limit.

@MikeSpreitzer
Copy link
Member

I favor adding metrics like the apiserver_flowcontrol_priority_level_request_count samples and watermarks histograms but for seats instead of requests. At least for the executing phase. We do not have a configured limit on seats in a queue, so there is no denominator to use for that.

I favor also adding metrics about the two-phase structure. One direct metric of interest would be a histogram of observations of the dead space fraction of each request --- that is, (maxSeats - initialSeats) * initialDuration / (maxSeats * totalDuration).

We have the apiserver_flowcontrol_request_execution_seconds histogram already. That means there is a counter metric apiserver_flowcontrol_request_execution_seconds_sum that accumulates the sum of these times. I do not recall exactly which time is being observed right now, but we probably should split it by a label with values "initial" and "final". In PromQL, we can take the difference over the last N minutes and compare those differences to see the relative magnitude of the two phases.

We already have the apiserver_flowcontrol_request_wait_duration_seconds histogram showing the distribution of waiting times. Waiting only happens because of seat occupancy. Adding the dead space fraction histogram would describe that seat occupancy. Independently. With what I found in the testgrid infrastructure, namely just one metrics scrape at the end, that leaves correlation questions unanswered. But in real life there is no "end", operators have to monitor on-line using windowed queries. That can show correlation, coarsely.

If we wanted the fine-grained correlation you mention above, we would have to add tracking in queueSet of how many seats are occupied at the moment only in anticipation of the larger later width. And I am not sure that what we want to measure is simply the boolean condition of whether there is at least one request waiting due to anticipatory seat occupancy. I think it matters how many there are. And what we are really trying to get at is latency, so we may want to think harder.

But more fundamentally, I think we may have made a wrong turn by accounting work differently in the real and virtual worlds. In the virtual world, the code currently says that the work in a request is initialSeats * initialDuration + finalSeats * finalDuration. But in the real world, max(initialSeats, finalSeats) are occupied for all of initialDuration + finalDuration. In the original fair queuing technique, and in our work until we added the second phase of execution, the real and virtual worlds accounted the same amount of work to each request. The difference was just that in the virtual world the executions are interleaved in a hyper-fine way.

@wojtek-t wojtek-t mentioned this pull request Oct 14, 2021
@wojtek-t
Copy link
Member Author

@MikeSpreitzer - let's move the discussion to: #105683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants