-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate width of the request based on watchers count in P&F #103539
Estimate width of the request based on watchers count in P&F #103539
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3031130
to
aff7c62
Compare
aff7c62
to
9183d04
Compare
9183d04
to
1686697
Compare
57ef6ba
to
41f609d
Compare
7487bbf
to
c5a77d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeSpreitzer - thanks for the review.
I applied your comments and I'm unholding, but it now needs re-lgtm
/hold cancel
finalSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat)) | ||
additionalLatency := eventAdditionalDuration | ||
|
||
// TODO: Remove this condition after we tune the algorithm better. | ||
// Technically, there is an overhead connected to processing an event after | ||
// the request finishes even if there is a small number of watches. | ||
// However, until we tune the estimation we want to stay on the safe side | ||
// an avoid introducing additional latency for almost every single request. | ||
if watchCount < watchesPerSeat { | ||
finalSeats = 0 | ||
additionalLatency = time.Duration(0) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -125,13 +126,18 @@ func (qs *queueSet) completeWorkEstimate(we *fcrequest.WorkEstimate) completedWo | |||
return completedWorkEstimate{ | |||
WorkEstimate: *we, | |||
totalWork: qs.computeTotalWork(we), | |||
finalWork: qs.computeFinalWork(we), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return SeatsTimesDuration(float64(we.InitialSeats), qs.estimatedServiceDuration) | ||
} | ||
|
||
func (qs *queueSet) computeFinalWork(we *fcrequest.WorkEstimate) SeatSeconds { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a method of fcrequest.WorkEstimate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
I'm afraid this PR broke kubemark-500:
|
I am looking at https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344 but am not familiar with how these tests work. Also, BTW, why does the kube-apiserver log have two different forms of httplog lines for watches? One example is
and another example is
? |
The complaint seems to be (with added whitespace for readability):
I looked in the kube-apiserver log, and indeed there are some 3+ second latencies for POST to pods, almost entirely due to the APF filter. Examples: I1012 20:42:18.811476 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.019339532s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="26ece8e6-7844-493b-a7fd-9a2d8bf29908" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.017415416s" resp=201 I1012 20:42:18.818386 11 httplog.go:124] "HTTP" verb="POST" URI="/api/v1/namespaces/test-dm2gcw-5/pods" latency="3.01605656s" userAgent="kube-controller-manager/v1.23.0 (linux/amd64) kubernetes/af2ed25/system:serviceaccount:kube-system:job-controller" audit-ID="97a9b2a1-9bb3-461f-9003-55316faa470d" srcIP="[::1]:46178" apf_pl="workload-high" apf_fs="kube-system-service-accounts" apf_fd="test-dm2gcw-5" fl_priorityandfairness="3.014098725s" resp=201 |
Looking at https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-500-gce/1448021022367289344/artifacts/MetricsForE2E_load_2021-10-12T21:02:06Z.json , I see
and all the higher buckets have the same value --- i.e., all the observations were less than or equal to 0.5. In other words, the workload-high priority level never used more than half of its allocated concurrency according to this metric. Now, to double-check what that metric really reports nowadays... |
Looking at the apiserver_flowcontrol_request_wait_duration_seconds_bucket metric for execute=true, flow_schema=kube-system-service-accounts, priority_level=workload-high, I see 128615 <= 1s, 130489 <= 2s, and 131226 <= 5s (and it tops out there). |
Yeah - I think it's the culprit. |
How about just tweaking up the constants? |
We should do that, but it's more important to fix scalability tests and tweaking will take a bit of time. I opened #105647 to disable that logic. |
I approved that PR. I checked the |
If we replace that histogram with one that accumulates observations of |
@MikeSpreitzer - do we believe this metric was useful even before? I'm not entirely sure how I would suppose to use this metric.
Thinking about that - I'm wondering what metrics would be useful. Things that I might be interested about:
Can you clarify? In real world we are doing both the "initial work" and "final work". We're accomodating for that in virtual world too (i.e. the r(t) function is progressing according to the sum, right?) |
Before generalizing width apiserver_flowcontrol_priority_level_request_count_watermarks histogram showed the extrema of occupancy. Observations of 1 showed that all seats were occupied. The apiserver_flowcontrol_priority_level_request_count_samples histogram showed not necessarily the extrema but a survey (even weighted over time). The watermarks histogram adds extrema that were missed in the sampling. I think they are meaningful even today, we just have to understand the denominator in the quotients. What we learned in this case is that the number of executing requests got as high as half of the concurrency limit. |
I favor adding metrics like the apiserver_flowcontrol_priority_level_request_count samples and watermarks histograms but for seats instead of requests. At least for the executing phase. We do not have a configured limit on seats in a queue, so there is no denominator to use for that. I favor also adding metrics about the two-phase structure. One direct metric of interest would be a histogram of observations of the dead space fraction of each request --- that is, We have the apiserver_flowcontrol_request_execution_seconds histogram already. That means there is a counter metric apiserver_flowcontrol_request_execution_seconds_sum that accumulates the sum of these times. I do not recall exactly which time is being observed right now, but we probably should split it by a label with values "initial" and "final". In PromQL, we can take the difference over the last N minutes and compare those differences to see the relative magnitude of the two phases. We already have the apiserver_flowcontrol_request_wait_duration_seconds histogram showing the distribution of waiting times. Waiting only happens because of seat occupancy. Adding the dead space fraction histogram would describe that seat occupancy. Independently. With what I found in the testgrid infrastructure, namely just one metrics scrape at the end, that leaves correlation questions unanswered. But in real life there is no "end", operators have to monitor on-line using windowed queries. That can show correlation, coarsely. If we wanted the fine-grained correlation you mention above, we would have to add tracking in queueSet of how many seats are occupied at the moment only in anticipation of the larger later width. And I am not sure that what we want to measure is simply the boolean condition of whether there is at least one request waiting due to anticipatory seat occupancy. I think it matters how many there are. And what we are really trying to get at is latency, so we may want to think harder. But more fundamentally, I think we may have made a wrong turn by accounting work differently in the real and virtual worlds. In the virtual world, the code currently says that the work in a request is |
@MikeSpreitzer - let's move the discussion to: #105683 |
/kind feature
/priority important-soon
/sig api-machinery
/assign @tkashem