Add scheduler plugin execution duration metric. #84522

liu-cong · 2019-10-29T14:57:02Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

/kind feature

What this PR does / why we need it:
Helps us understand feature adoption as well as performance of each plugins.

Which issue(s) this PR fixes:

Fixes #83416

Special notes for your reviewer:
I compared benchmark results for
a) master branch,
b) master branch with extension point metrics removed,
c) this change and
d) this change with sampling every 15 scheduling cycles.

a) vs b): b) improves over a) slightly in many cases, but almost negligible considering noises.
c) vs d): By sampling the performance could be improved for some cases but not for all.

It's surprising that c) is better than a)/b) in test case #1,3,8 consistently, while worse in #2,4,5,6 and very close for #7. I don't know how these numbers map to real use cases, given the value of getting visibility into each plugin, I prefer to get this change out and get some feedback from users.

Benchmark Result (ns/op)

Test #	Test	Master	No metric	Periodic flush	Periodic flush with 1/15 sample
1	BenchmarkSchedulingV/5000Nodes/1000Pods-12	9237142	9229039	7761435	6754970
2	BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12	9114743	8970976	10910383	9660827
3	BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12	8656598	10362103	8220864	7031266
4	BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12	14089110	13623460	15478250	14093959
5	BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12	12615123	12240668	14704521	13194460
6	BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12	11570522	11356818	13410923	11998329
7	BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12	24290660	25244090	24926082	25813202
8	BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12	19159768	18998646	17467853	18523843

Does this PR introduce a user-facing change?:

Add plugin_execution_duration_seconds metric for scheduler framework plugins.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

liu-cong · 2019-10-29T14:58:25Z

@ahg-g
Added latency metric for Filter and Score plugins for doing perf tests. I couldn't run the perf tests locally for some reason. While I am trying to fix that, do you mind checking out this diff and run it on you machine?

pkg/scheduler/framework/v1alpha1/framework.go

liu-cong · 2019-11-11T14:37:06Z

/assign @ahg-g

pkg/scheduler/framework/v1alpha1/framework.go

pkg/scheduler/framework/v1alpha1/metrics_recorder.go

ahg-g · 2019-11-11T18:40:11Z

pkg/scheduler/framework/v1alpha1/metrics_recorder.go

+}
+
+// metricRecorder records framework metrics in a separate goroutine to avoid overhead in the critical path.
+type realMetricsRecorder struct {


unit tests?

the behavior is tested in the framework_test.go, as we verify that metrics are recorded successfully. Or are you talking about specific test cases for this struct only?

ahg-g · 2019-11-11T19:37:18Z

Here are the benchmark numbers I am getting, the first one is likely to be one of the most common workload scenarios, and we observe roughly 20% perf drop:

With the change

BenchmarkSchedulingV/5000Nodes/1000Pods-12          1000           2237820 ns/op
BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12                    1000           5520113 ns/op
BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12                            1000           2506520 ns/op
BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12                          1000           6115268 ns/op
BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12                  1000           6025317 ns/op
BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12                             1000           6197104 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12                        1000           7251515 ns/op
BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12                       1000           3317381 ns/op

With the change:

BenchmarkSchedulingV/5000Nodes/1000Pods-12          1000           2724880 ns/op
BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12                    1000           5885653 ns/op
BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12                            1000           2858365 ns/op
BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12                          1000           6258707 ns/op
BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12                  1000           6306560 ns/op
BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12                             1000           6346709 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12                        1000           7600392 ns/op
BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12                       1000           3556736 ns/op

To reduce the overhead further, we can sample reading the time and pushing to metricsRecorder channel. So we either measure for all plugins in a specific Run* invocation, or we do nothing.

liu-cong · 2019-11-11T20:56:56Z

I updated to the latest master and here are the new results:

make test-integration WHAT=./test/integration/scheduler_perf KUBE_TEST_VMODULE="''" KUBE_TEST_ARGS="-alsologtostderr=false -logtostderr=false -run=^$$ -bench=BenchmarkScheduling/5000Nodes/1000Pods -cpuprofile ~/cpu-master.out" |grep Benchmark

Master:
BenchmarkSchedulingV/5000Nodes/1000Pods-12 1000 4900572 ns/op
BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12 1000 5321680 ns/op
BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12 1000 4943341 ns/op
BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12 1000 4078854 ns/op
BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12 1000 4590515 ns/op
BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12 1000 5134987 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12 1000 12269336 ns/op
BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12 1000 6553250 ns/op

With the change:
BenchmarkSchedulingV/5000Nodes/1000Pods-12 1000 4119963 ns/op
BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12 1000 6134407 ns/op
BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12 1000 4521283 ns/op
BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12 1000 4110121 ns/op
BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12 1000 4280762 ns/op
BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12 1000 3930512 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12 1000 12414046 ns/op
BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12 1000 6739703 ns/op

I am still seeing better results with this change for the first test case, which is strange. Now I trust your results more than mine... Can you please take the latest commit with sampling and run it again?
@ahg-g

ahg-g · 2019-11-11T21:31:04Z

with the latest patch:

BenchmarkSchedulingV/5000Nodes/1000Pods-12          1000           2417580 ns/op
BenchmarkSchedulingPodAntiAffinity/5000Nodes/1000Pods-12                    1000           5496006 ns/op
BenchmarkSchedulingSecrets/5000Nodes/1000Pods-12                            1000           2535986 ns/op
BenchmarkSchedulingInTreePVs/5000Nodes/1000Pods-12                          1000           6165100 ns/op
BenchmarkSchedulingMigratedInTreePVs/5000Nodes/1000Pods-12                  1000           6099270 ns/op
BenchmarkSchedulingCSIPVs/5000Nodes/1000Pods-12                             1000           6150523 ns/op
BenchmarkSchedulingPodAffinity/5000Nodes/1000Pods-12                        1000           7309418 ns/op
BenchmarkSchedulingNodeAffinity/5000Nodes/1000Pods-12                       1000           3344050 ns/op

ahg-g

this is much better, overhead is negligible.

pkg/scheduler/framework/v1alpha1/cycle_state.go

pkg/scheduler/framework/v1alpha1/framework.go

ahg-g · 2019-11-12T16:41:55Z

/approve
please squash

k8s-ci-robot · 2019-11-12T16:42:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2019-11-12T17:24:53Z

/retest

Address comments Sample metrics Use rand.Intn and some cleanup

ahg-g · 2019-11-12T18:04:46Z

/lgtm

liu-cong · 2019-11-12T18:39:19Z

/retest

liu-cong · 2019-11-12T18:57:28Z

/retest

liu-cong · 2019-11-12T19:21:41Z

/retest

k8s-ci-robot requested review from hex108 and k82cn October 29, 2019 14:58

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 29, 2019

damemi reviewed Oct 29, 2019

View reviewed changes

pkg/scheduler/framework/v1alpha1/framework.go Show resolved Hide resolved

liu-cong force-pushed the plugin branch 3 times, most recently from aa4f6a8 to 1372a16 Compare November 11, 2019 14:10

k8s-ci-robot assigned ahg-g Nov 11, 2019

liu-cong marked this pull request as ready for review November 11, 2019 14:37

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2019

liu-cong force-pushed the plugin branch from 1372a16 to c47c5ee Compare November 11, 2019 15:32

ahg-g reviewed Nov 11, 2019

View reviewed changes

liu-cong force-pushed the plugin branch from 9868bea to 48f4e4c Compare November 11, 2019 20:16

liu-cong force-pushed the plugin branch from 9643ad7 to ccedd60 Compare November 12, 2019 02:53

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 12, 2019

liu-cong force-pushed the plugin branch from ccedd60 to 5e7a843 Compare November 12, 2019 03:36

ahg-g reviewed Nov 12, 2019

View reviewed changes

pkg/scheduler/framework/v1alpha1/cycle_state.go Show resolved Hide resolved

pkg/scheduler/framework/v1alpha1/framework.go Outdated Show resolved Hide resolved

liu-cong force-pushed the plugin branch from b367144 to c788525 Compare November 12, 2019 16:27

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 12, 2019

liu-cong force-pushed the plugin branch from c788525 to 79eb4bf Compare November 12, 2019 16:43

Add scheduler plugin execution duration metric.

af6a816

Address comments Sample metrics Use rand.Intn and some cleanup

liu-cong force-pushed the plugin branch from 79eb4bf to af6a816 Compare November 12, 2019 18:00

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2019

k8s-ci-robot merged commit ce11622 into kubernetes:master Nov 12, 2019

k8s-ci-robot added this to the v1.17 milestone Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scheduler plugin execution duration metric. #84522

Add scheduler plugin execution duration metric. #84522

liu-cong commented Oct 29, 2019 •

edited

Loading

liu-cong commented Oct 29, 2019

liu-cong commented Nov 11, 2019

ahg-g Nov 11, 2019

liu-cong Nov 11, 2019

ahg-g commented Nov 11, 2019 •

edited

Loading

liu-cong commented Nov 11, 2019

ahg-g commented Nov 11, 2019

ahg-g left a comment

ahg-g commented Nov 12, 2019

k8s-ci-robot commented Nov 12, 2019

ahg-g commented Nov 12, 2019

ahg-g commented Nov 12, 2019

liu-cong commented Nov 12, 2019

liu-cong commented Nov 12, 2019

liu-cong commented Nov 12, 2019

Add scheduler plugin execution duration metric. #84522

Add scheduler plugin execution duration metric. #84522

Conversation

liu-cong commented Oct 29, 2019 • edited Loading

Benchmark Result (ns/op)

liu-cong commented Oct 29, 2019

liu-cong commented Nov 11, 2019

ahg-g Nov 11, 2019

Choose a reason for hiding this comment

liu-cong Nov 11, 2019

Choose a reason for hiding this comment

ahg-g commented Nov 11, 2019 • edited Loading

liu-cong commented Nov 11, 2019

ahg-g commented Nov 11, 2019

ahg-g left a comment

Choose a reason for hiding this comment

ahg-g commented Nov 12, 2019

k8s-ci-robot commented Nov 12, 2019

ahg-g commented Nov 12, 2019

ahg-g commented Nov 12, 2019

liu-cong commented Nov 12, 2019

liu-cong commented Nov 12, 2019

liu-cong commented Nov 12, 2019

liu-cong commented Oct 29, 2019 •

edited

Loading

ahg-g commented Nov 11, 2019 •

edited

Loading