-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scheduler plugin execution duration metric. #84522
Conversation
@ahg-g |
aa4f6a8
to
1372a16
Compare
/assign @ahg-g |
} | ||
|
||
// metricRecorder records framework metrics in a separate goroutine to avoid overhead in the critical path. | ||
type realMetricsRecorder struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior is tested in the framework_test.go, as we verify that metrics are recorded successfully. Or are you talking about specific test cases for this struct only?
Here are the benchmark numbers I am getting, the first one is likely to be one of the most common workload scenarios, and we observe roughly 20% perf drop: With the change
With the change:
To reduce the overhead further, we can sample reading the time and pushing to metricsRecorder channel. So we either measure for all plugins in a specific Run* invocation, or we do nothing. |
I updated to the latest master and here are the new results:
Master: With the change: I am still seeing better results with this change for the first test case, which is strange. Now I trust your results more than mine... Can you please take the latest commit with sampling and run it again? |
with the latest patch:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is much better, overhead is negligible.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Address comments Sample metrics Use rand.Intn and some cleanup
/lgtm |
/retest |
2 similar comments
/retest |
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Helps us understand feature adoption as well as performance of each plugins.
Which issue(s) this PR fixes:
Fixes #83416
Special notes for your reviewer:
I compared benchmark results for
a) master branch,
b) master branch with extension point metrics removed,
c) this change and
d) this change with sampling every 15 scheduling cycles.
a) vs b): b) improves over a) slightly in many cases, but almost negligible considering noises.
c) vs d): By sampling the performance could be improved for some cases but not for all.
It's surprising that c) is better than a)/b) in test case #1,3,8 consistently, while worse in #2,4,5,6 and very close for #7. I don't know how these numbers map to real use cases, given the value of getting visibility into each plugin, I prefer to get this change out and get some feedback from users.
Benchmark Result (ns/op)
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: