Estimate width of the request based on watchers count in P&F #103539

wojtek-t · 2021-07-07T09:17:08Z

/kind feature
/priority important-soon
/sig api-machinery

NONE

/assign @tkashem

k8s-ci-robot · 2021-07-07T09:17:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t

@MikeSpreitzer - thanks for the review.
I applied your comments and I'm unholding, but it now needs re-lgtm

/hold cancel

wojtek-t · 2021-10-12T07:46:57Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/request/mutating_work_estimator.go

+	finalSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat))
+	additionalLatency := eventAdditionalDuration
+
+	// TODO: Remove this condition after we tune the algorithm better.
+	//   Technically, there is an overhead connected to processing an event after
+	//   the request finishes even if there is a small number of watches.
+	//   However, until we tune the estimation we want to stay on the safe side
+	//   an avoid introducing additional latency for almost every single request.
+	if watchCount < watchesPerSeat {
+		finalSeats = 0
+		additionalLatency = time.Duration(0)
+	}


wojtek-t · 2021-10-12T07:54:48Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/types.go

@@ -125,13 +126,18 @@ func (qs *queueSet) completeWorkEstimate(we *fcrequest.WorkEstimate) completedWo
 	return completedWorkEstimate{
 		WorkEstimate: *we,
 		totalWork:    qs.computeTotalWork(we),
+		finalWork:    qs.computeFinalWork(we),


MikeSpreitzer · 2021-10-12T15:28:30Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/types.go

+	return SeatsTimesDuration(float64(we.InitialSeats), qs.estimatedServiceDuration)
+}
+
+func (qs *queueSet) computeFinalWork(we *fcrequest.WorkEstimate) SeatSeconds {


This could be a method of fcrequest.WorkEstimate.

MikeSpreitzer

/LGTM

mborsz · 2021-10-12T21:26:49Z

I'm afraid this PR broke kubemark-500:

Testgrid: https://testgrid.k8s.io/sig-scalability-kubemark#kubemark-master-500
Last green: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1447988054445789184
First red: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344
According to 3271dff...af2ed25 this is the only PR in range.

MikeSpreitzer · 2021-10-13T04:46:38Z

I am looking at https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344 but am not familiar with how these tests work.

Also, BTW, why does the kube-apiserver log have two different forms of httplog lines for watches? One example is

I1012 21:07:32.046122      11 httplog.go:124] "HTTP" verb="GET" URI="/api/v1/namespaces/test-q781vv-1/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Daccount-53-11&resourceVersion=351459&timeout=7m13s&timeoutSeconds=433&watch=true" latency="1.721721704s" userAgent="kubemark/v1.23.0 (linux/amd64) kubernetes/af2ed25" audit-ID="efb7b8c4-d43b-477c-80e8-fd009ef4aabb" srcIP="10.64.5.26:35890" apf_pl="exempt" apf_fs="exempt" apf_fd="" apf_init_latency="296.835µs" resp=200

and another example is

I1012 21:07:31.983202      11 httplog.go:124] "HTTP" verb="WATCH" URI="/apis/storage.k8s.io/v1/csidrivers?allowWatchBookmarks=true&resourceVersion=327520&timeout=5m29s&timeoutSeconds=329&watch=true" latency="5m29.000606811s" userAgent="kubemark/v1.23.0 (linux/amd64) kubernetes/af2ed25" audit-ID="6aaa6416-4309-4097-a9c9-b6158845ba41" srcIP="10.64.7.46:55858" apf_pl="exempt" apf_fs="exempt" apf_fd="" apf_init_latency="303.958µs" resp=200

?

MikeSpreitzer · 2021-10-13T04:58:57Z

The complaint seems to be (with added whitespace for readability):

[measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheusSimple error: top latency metric: there should be no high-latency requests, but: [
got: &{Resource:pods Subresource: Verb:POST Scope:resource Latency:perc50: 175.750111ms,
           perc90: 1.243873873s, perc99: 3.048313253s Count:18570 SlowCount:2401};
expected perc99 <= 1s

got: &{Resource:pods Subresource:binding Verb:POST Scope:resource Latency:perc50: 179.31098ms,
           perc90: 738.232044ms, perc99: 2.895361216s Count:18578 SlowCount:1343};
expected perc99 <= 1s

got: &{Resource:subjectaccessreviews Subresource: Verb:POST Scope:resource Latency:perc50: 35.03937ms,
          perc90: 144.285714ms, perc99: 1.554999999s Count:179 SlowCount:3};
expected perc99 <= 1s

got: &{Resource:events Subresource: Verb:PATCH Scope:resource Latency:perc50: 45.394736ms,
          perc90: 184.264705ms, perc99: 1.2575s Count:901 SlowCount:10};
expected perc99 <= 1s]]

I looked in the kube-apiserver log, and indeed there are some 3+ second latencies for POST to pods, almost entirely due to the APF filter. Examples:

I1012 20:42:18.811476 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.019339532s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="26ece8e6-7844-493b-a7fd-9a2d8bf29908" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.017415416s" resp=201

I1012 20:42:18.818386 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.01605656s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="97a9b2a1-9bb3-461f-9003-55316faa470d" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.014098725s" resp=201

MikeSpreitzer · 2021-10-13T05:14:51Z

Looking at https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344/artifacts/MetricsForE2E_load_2021-10-12T21:02:06Z.json , I see

      {
        "metric": {
          "__name__": "apiserver_flowcontrol_priority_level_request_count_watermarks_bucket",
          "le": "0.5",
          "mark": "high",
          "phase": "executing",
          "priority_level": "workload-high"
        },
        "value": [
          0,
          "1722454"
        ]
      },

and all the higher buckets have the same value --- i.e., all the observations were less than or equal to 0.5. In other words, the workload-high priority level never used more than half of its allocated concurrency according to this metric. Now, to double-check what that metric really reports nowadays...

MikeSpreitzer · 2021-10-13T05:24:37Z

Looking at the apiserver_flowcontrol_request_wait_duration_seconds_bucket metric for execute=true, flow_schema=kube-system-service-accounts, priority_level=workload-high, I see 128615 <= 1s, 130489 <= 2s, and 131226 <= 5s (and it tops out there).

wojtek-t · 2021-10-13T05:39:35Z

Yeah - I think it's the culprit.
Let' me disable the main part with dropping the main part of the PR and we will work on renabling independently.

MikeSpreitzer · 2021-10-13T05:40:38Z

How about just tweaking up the constants?
I am finding lots of things to improve in the metrics.

wojtek-t · 2021-10-13T05:44:51Z

How about just tweaking up the constants?

We should do that, but it's more important to fix scalability tests and tweaking will take a bit of time.

I opened #105647 to disable that logic.

MikeSpreitzer · 2021-10-13T05:58:14Z

I approved that PR.

I checked the apiserver_flowcontrol_priority_level_request_count_watermarks histogram and it accumulates observations of the fraction (number of executing requests) / (concurrency limit) --- which we know is now not so meaningful, dividing a number of requests by a number of seats.

MikeSpreitzer · 2021-10-13T06:07:02Z

If we replace that histogram with one that accumulates observations of (number of seats occupied) / (concurrency limit) then we remove the reason I agreed to sink extra latency into QueueSet rather than just do it in the filter --- namely, the ability to have metrics that distinguish the two phases of execution. But it's more complicated, now that the extra latency can occupy more seats. BTW, I wonder if we goofed up something basic by making the virtual world track a different amount of work than what happens in the real world --- (total duration) times (seats occupied).

wojtek-t · 2021-10-13T15:08:02Z

I checked the apiserver_flowcontrol_priority_level_request_count_watermarks histogram and it accumulates observations of the fraction (number of executing requests) / (concurrency limit) --- which we know is now not so meaningful, dividing a number of requests by a number of seats.

@MikeSpreitzer - do we believe this metric was useful even before? I'm not entirely sure how I would suppose to use this metric.

I agreed to sink extra latency into QueueSet rather than just do it in the filter --- namely, the ability to have metrics that distinguish the two phases of execution.

Thinking about that - I'm wondering what metrics would be useful. Things that I might be interested about:

in the last N minutes - how much time (per priority-level) was consumed by which phase
percentage of time (per priority-level) when some requests are waiting even though we're not (at least theoretically) occupying all seats with that PL.
WDYT?

BTW, I wonder if we goofed up something basic by making the virtual world track a different amount of work than what happens in the real world --- (total duration) times (seats occupied).

Can you clarify? In real world we are doing both the "initial work" and "final work". We're accomodating for that in virtual world too (i.e. the r(t) function is progressing according to the sum, right?)

MikeSpreitzer · 2021-10-13T16:05:52Z

Before generalizing width apiserver_flowcontrol_priority_level_request_count_watermarks histogram showed the extrema of occupancy. Observations of 1 showed that all seats were occupied. The apiserver_flowcontrol_priority_level_request_count_samples histogram showed not necessarily the extrema but a survey (even weighted over time). The watermarks histogram adds extrema that were missed in the sampling. I think they are meaningful even today, we just have to understand the denominator in the quotients. What we learned in this case is that the number of executing requests got as high as half of the concurrency limit.

MikeSpreitzer · 2021-10-14T06:26:09Z

I favor adding metrics like the apiserver_flowcontrol_priority_level_request_count samples and watermarks histograms but for seats instead of requests. At least for the executing phase. We do not have a configured limit on seats in a queue, so there is no denominator to use for that.

I favor also adding metrics about the two-phase structure. One direct metric of interest would be a histogram of observations of the dead space fraction of each request --- that is, (maxSeats - initialSeats) * initialDuration / (maxSeats * totalDuration).

We have the apiserver_flowcontrol_request_execution_seconds histogram already. That means there is a counter metric apiserver_flowcontrol_request_execution_seconds_sum that accumulates the sum of these times. I do not recall exactly which time is being observed right now, but we probably should split it by a label with values "initial" and "final". In PromQL, we can take the difference over the last N minutes and compare those differences to see the relative magnitude of the two phases.

We already have the apiserver_flowcontrol_request_wait_duration_seconds histogram showing the distribution of waiting times. Waiting only happens because of seat occupancy. Adding the dead space fraction histogram would describe that seat occupancy. Independently. With what I found in the testgrid infrastructure, namely just one metrics scrape at the end, that leaves correlation questions unanswered. But in real life there is no "end", operators have to monitor on-line using windowed queries. That can show correlation, coarsely.

If we wanted the fine-grained correlation you mention above, we would have to add tracking in queueSet of how many seats are occupied at the moment only in anticipation of the larger later width. And I am not sure that what we want to measure is simply the boolean condition of whether there is at least one request waiting due to anticipatory seat occupancy. I think it matters how many there are. And what we are really trying to get at is latency, so we may want to think harder.

But more fundamentally, I think we may have made a wrong turn by accounting work differently in the real and virtual worlds. In the virtual world, the code currently says that the work in a request is initialSeats * initialDuration + finalSeats * finalDuration. But in the real world, max(initialSeats, finalSeats) are occupied for all of initialDuration + finalDuration. In the original fair queuing technique, and in our work until we added the second phase of execution, the real and virtual worlds accounted the same amount of work to each request. The difference was just that in the virtual world the executions are interleaved in a hyper-fine way.

wojtek-t · 2021-10-14T16:54:14Z

@MikeSpreitzer - let's move the discussion to: #105683

k8s-ci-robot assigned tkashem Jul 7, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2021

k8s-ci-robot requested review from hzxuzhonghu and lavalamp July 7, 2021 09:17

k8s-ci-robot added the area/apiserver label Jul 7, 2021

wojtek-t force-pushed the pf_estimate_watch_width branch from 3031130 to aff7c62 Compare July 7, 2021 10:55

This was referenced Jul 7, 2021

apf: estimate width for list request #103521

Merged

Add watch tracker to APF for request cost estimation #103481

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 7, 2021

wojtek-t force-pushed the pf_estimate_watch_width branch from aff7c62 to 9183d04 Compare July 8, 2021 06:17

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 8, 2021

wojtek-t force-pushed the pf_estimate_watch_width branch from 9183d04 to 1686697 Compare July 8, 2021 07:19

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 8, 2021

wojtek-t force-pushed the pf_estimate_watch_width branch 4 times, most recently from 57ef6ba to 41f609d Compare July 8, 2021 10:14

wojtek-t changed the title ~~[WIP] Estimate width of the request based on watchers count in P&F~~ Estimate width of the request based on watchers count in P&F Jul 8, 2021

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 12, 2021

wojtek-t added 2 commits October 12, 2021 09:49

Estimate width of the request based on watchers count in P&F

223f9be

Adjust final seats if they don't fit the limit

c5a77d8

wojtek-t force-pushed the pf_estimate_watch_width branch from 7487bbf to c5a77d8 Compare October 12, 2021 07:54

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2021

wojtek-t commented Oct 12, 2021

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 12, 2021

MikeSpreitzer reviewed Oct 12, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2021

k8s-ci-robot merged commit af2ed25 into kubernetes:master Oct 12, 2021

k8s-ci-robot added this to the v1.23 milestone Oct 12, 2021

wojtek-t mentioned this pull request Oct 13, 2021

Disable P&F for watch requests #105647

Merged

wojtek-t mentioned this pull request Oct 14, 2021

Rethink P&F metrics #105683

Closed

tkashem mentioned this pull request Oct 21, 2021

apf: kubemark-500 scale test fails with mutating request estimator enabled #105804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate width of the request based on watchers count in P&F #103539

Estimate width of the request based on watchers count in P&F #103539

wojtek-t commented Jul 7, 2021

k8s-ci-robot commented Jul 7, 2021

wojtek-t left a comment

wojtek-t Oct 12, 2021

wojtek-t Oct 12, 2021

MikeSpreitzer Oct 12, 2021

MikeSpreitzer left a comment

mborsz commented Oct 12, 2021

MikeSpreitzer commented Oct 13, 2021 •

edited

Loading

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 14, 2021

wojtek-t commented Oct 14, 2021

Estimate width of the request based on watchers count in P&F #103539

Estimate width of the request based on watchers count in P&F #103539

Conversation

wojtek-t commented Jul 7, 2021

k8s-ci-robot commented Jul 7, 2021

wojtek-t left a comment

Choose a reason for hiding this comment

wojtek-t Oct 12, 2021

Choose a reason for hiding this comment

wojtek-t Oct 12, 2021

Choose a reason for hiding this comment

MikeSpreitzer Oct 12, 2021

Choose a reason for hiding this comment

MikeSpreitzer left a comment

Choose a reason for hiding this comment

mborsz commented Oct 12, 2021

MikeSpreitzer commented Oct 13, 2021 • edited Loading

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

wojtek-t commented Oct 13, 2021

MikeSpreitzer commented Oct 13, 2021

MikeSpreitzer commented Oct 14, 2021

wojtek-t commented Oct 14, 2021

MikeSpreitzer commented Oct 13, 2021 •

edited

Loading