-Add scheduler optimization options, short circuit all predicates if … #56926

wgliang · 2017-12-07T13:12:07Z

…one predicate fails

Signed-off-by: Wang Guoliang iamwgliang@gmail.com

What this PR does / why we need it:
Short circuit all predicates if one predicate fails.

I think we can add a switch to control it, maybe some scenes do not need to know all the causes of failure, but also can get a great performance improvement; if you need to fully understand the reasons for the failure, and accept the current performance requirements, can maintain the current logic. It should expose this switch to the user.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes #56889 and #48186

Special notes for your reviewer:
@davidopp

Release note:

Allow scheduler set AlwaysCheckAllPredicates, short circuit all predicates if one predicate fails can greatly improve the scheduling performance.

wgliang · 2017-12-07T13:16:29Z

/assign @davidopp @jayunit100

dims · 2017-12-11T21:33:01Z

/ok-to-test

wgliang · 2017-12-12T02:24:03Z

/test pull-kubernetes-unit
/test pull-kubernetes-e2e-gce
/test pull-kubernetes-bazel-test

wgliang · 2017-12-12T03:00:42Z

The test failed, I think I might need help when I short circuit all predicates if one predicate fails, what else do I need.

wgliang · 2017-12-12T03:47:33Z

/retest

wgliang · 2017-12-12T04:22:08Z

how can i fix it?
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56926/pull-kubernetes-bazel-test/21327/

W1212 03:56:43.025] ____From Testing //pkg/apis/core/validation:go_default_test:
I1212 03:56:43.126] ==================== Test output for //pkg/apis/core/validation:go_default_test:
I1212 03:56:43.126] --- FAIL: TestValidatePersistentVolumeClaimUpdate (0.01s)
I1212 03:56:43.126] 	validation_test.go:1428: Unexpected failure for scenario: valid-size-update-resize-enabled - [spec: Forbidden: is immutable after creation except resources.requests for bound claims]
I1212 03:56:43.127] FAIL

davidopp · 2017-12-12T08:26:19Z

/cc @kubernetes/sig-scheduling-pr-reviews

resouer · 2017-12-12T17:26:56Z

pkg/apis/componentconfig/types.go

@@ -111,6 +111,9 @@ type KubeSchedulerConfiguration struct {
 	// Indicate the "all topologies" set for empty topologyKey when it's used for PreferredDuringScheduling pod anti-affinity.
 	// DEPRECATED: This is no longer used.
 	FailureDomains string
+
+	// EnablePodFitsOnNodeOptimization enables shorting circuit all predicates if one predicate fails
+	EnablePodFitsOnNodeOptimization bool


This is a bad name ... should be more specific like QuickFailPredicate or something else

I don't think we need an option for this. This should be default behavior in my opinion. It doesn't make sense to check all other predicates when one fails. We are adding the logic to order predicates and default ordering is to run predicates with lower overhead first. The whole ordering of predicates makes sense only if we stop checking predicates when one fails. Otherwise, the ordering is not useful.

UPDATE: reading another thread, we may actually need a flag for those customers who want to find what are all the predicates that have failed for a node. So, let's rename this flag to something like AlwaysCheckAllPredicates. Default value should be false. Please add a comment in front of the flag to mention that enabling this option may hurt scheduler performance.

Totally agree this should put predicate ordering into consideration.

cc @wgliang It's https://github.com/kubernetes/community/pull/1152/files

Sorry, it's my fault I should say that put predicate ordering will improve performance even more. In fact, we have already sorted out our own scheduler, for example, we will calculate the smaller number of CPU cores, memory size in front of the complex information such as disk and network. In this way, the implementation of the previous failure, to avoid later more complex calculations.

@resouer You are right, I will rename it.

+1 for default behaviour setting AlwaysCheckAllPredicates to false.

resouer · 2017-12-12T17:33:39Z

This is related to what @bsalamat discussed recently.

But I am not very sure if an extra option is the solution ... or, fast fail should be default behavior at least . WDYT bobby?

wgliang · 2017-12-13T00:27:09Z

@resouer Instead of modifying current default behavior, I wanted to add a higher-performance option, albeit losing some failure information.

k82cn · 2017-12-17T01:40:06Z

plugin/pkg/scheduler/core/generic_scheduler.go

@@ -917,7 +923,7 @@ func selectVictimsOnNode(
 	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
 	reprievePod := func(p *v1.Pod) bool {
 		addPod(p)
-		fits, _, _ := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, queue)
+		fits, _, _ := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, queue, false)


Should it be true?

It has been modified.

wgliang · 2017-12-18T01:09:44Z

/retest

wgliang · 2017-12-18T02:23:55Z

Who can help me solve the problem or tell me the reason of test failure?

wgliang · 2017-12-18T12:16:04Z

/retest

ravisantoshgudimetla · 2018-01-05T10:18:33Z

plugin/pkg/scheduler/core/generic_scheduler.go

@@ -474,6 +475,11 @@ func podFitsOnNode(
 				if !fit {
 					// eCache is available and valid, and predicates result is unfit, record the fail reasons
 					failedPredicates = append(failedPredicates, reasons...)
+					// since alwaysCheckAllPredicates has not been set, the predicate evaluation is short
+					// circuited and there are chances of other predicates failing as well.


Actually, I wanted this to be logged using glog.V(5) or higher not as a comment in the code, the reason being debugging becomes easy when someone is going through scheduler logs.

As for each node will be implemented here, so there will be a lot of log information is printed out, are you sure it is friendly?

I think thats a trade-off that we will have but putting logging at highest level might reduce the impact.

yastij · 2018-01-05T18:16:31Z

@wgliang - can you add your release-note between `` ?

bsalamat · 2018-01-05T20:59:09Z

plugin/pkg/scheduler/core/generic_scheduler_test.go

@@ -377,6 +378,17 @@ func TestGenericScheduler(t *testing.T) {
 			expectsErr: true,
 			wErr:       fmt.Errorf("persistentvolumeclaim \"existingPVC\" is being deleted"),
 		},
+		{
+			// alwaysCheckAllPredicates is true


Sorry, but this is not useful. You should probably add a couple of predicates and create a pod that fails all predicates and then check that all of the predicates are exercised when alwaysCheckAllPredicates is true.

@bsalamat Thank you very much for your guidance, I have already done it.

wgliang · 2018-01-09T09:08:22Z

/test pull-kubernetes-unit

yastij · 2018-01-10T22:28:19Z

@wgliang @bsalamat - Is this one ready to merge ?

wgliang · 2018-01-11T02:38:51Z

I'm ready.
ping @bsalamat

bsalamat · 2018-01-11T08:32:10Z

/lgtm

Thanks, @wgliang!

wgliang · 2018-01-11T12:09:06Z

/assign [@brendandburns @eparis @thockin @zmerlynn]

thockin · 2018-01-13T01:19:29Z

/approve

…one predicate fails

wgliang · 2018-01-13T14:55:00Z

@bsalamat There has been a conflict commit, has been rebase. Please help delete do-not-merge label and merge PR. Thanks!

ravisantoshgudimetla · 2018-01-13T15:26:49Z

/lgtm
based on #56926 (comment). @wgliang You need to add your release-note between `` to remove the do-not-merge/release-note-label-needed. I think I don't have enough power to lgtm :).

k82cn · 2018-01-14T12:04:20Z

/lgtm

k8s-ci-robot · 2018-01-14T12:04:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, k82cn, ravisantoshgudimetla, thockin, wgliang

Associated issue: #56889

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~examples/OWNERS~~ [thockin]
~~pkg/scheduler/OWNERS~~ [bsalamat,k82cn,thockin]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2018-01-14T12:05:20Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-01-14T12:53:05Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-github-robot assigned freehan and cjcullen Dec 7, 2017

k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Dec 7, 2017

wgliang mentioned this pull request Dec 7, 2017

-Concurrent execution of all predicateFuncs #56889

Closed

k8s-ci-robot assigned davidopp and jayunit100 Dec 7, 2017

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 11, 2017

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 12, 2017

resouer reviewed Dec 12, 2017

View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2017

k82cn reviewed Dec 17, 2017

View reviewed changes

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 17, 2017

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2017

ravisantoshgudimetla reviewed Jan 5, 2018

View reviewed changes

bsalamat reviewed Jan 5, 2018

View reviewed changes

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 6, 2018

k8s-ci-robot assigned bsalamat Jan 11, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2018

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 13, 2018

-Add scheduler optimization options, short circuit all predicates if …

b8526cd

…one predicate fails

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 13, 2018

k8s-ci-robot assigned k82cn Jan 14, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2018

k8s-github-robot merged commit 5911f87 into kubernetes:master Jan 14, 2018

wgliang mentioned this pull request Sep 4, 2019

Deprecate alwaysCheckAllPredicates configuration parameter in the scheduler #82180

Closed

-Add scheduler optimization options, short circuit all predicates if … #56926

-Add scheduler optimization options, short circuit all predicates if … #56926

Conversation

wgliang commented Dec 7, 2017 • edited Loading

wgliang commented Dec 7, 2017

dims commented Dec 11, 2017

wgliang commented Dec 12, 2017

wgliang commented Dec 12, 2017

wgliang commented Dec 12, 2017

wgliang commented Dec 12, 2017

davidopp commented Dec 12, 2017

resouer Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer commented Dec 12, 2017

wgliang commented Dec 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgliang commented Dec 18, 2017

wgliang commented Dec 18, 2017

wgliang commented Dec 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yastij commented Jan 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgliang commented Jan 9, 2018

yastij commented Jan 10, 2018

wgliang commented Jan 11, 2018

bsalamat commented Jan 11, 2018

wgliang commented Jan 11, 2018

thockin commented Jan 13, 2018

wgliang commented Jan 13, 2018

ravisantoshgudimetla commented Jan 13, 2018 • edited Loading

k82cn commented Jan 14, 2018

k8s-ci-robot commented Jan 14, 2018

k8s-github-robot commented Jan 14, 2018

k8s-github-robot commented Jan 14, 2018

wgliang commented Dec 7, 2017 •

edited

Loading

resouer Dec 12, 2017 •

edited

Loading

bsalamat Dec 12, 2017 •

edited

Loading

ravisantoshgudimetla commented Jan 13, 2018 •

edited

Loading