Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

Merged

Conversation

NoicFank
Copy link
Contributor

@NoicFank NoicFank commented Oct 24, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

BUG: preemption picks wrong victim node with higher priority pod on it.

For example (pls run test pod can be made schedulable on minHighestPriority node to verify)

when podA with veryHighPriority failed in scheduling, it try to preempt other pods. currently there are two nodes that meet the preemption requirements:

  • node1: with two victims: pod1(highPriority) & pod2(lowPriority), while preempting pod1 do not violate PDB, but preempting pod2 will violate PDB.
  • node2: with one victim: pod3(midPriority), while preempting pod3 will violatePDB.

Both node1 & node2 has one pod violate PDB if preempting them. Then a node with minimum highest priority victim is picked, currently the highest priority victim on :

  • node1 is highPriority(pod1)
  • node2 is midPriority(pod2)

Then, it clearly, preemption should pick node2. But currently it will select wrong node node1. And pod1 with highPriority will be preempted, instead of pod3 with midPriority.

Which issue(s) this PR fixes:

Fixes #NONE

Special notes for your reviewer:

@Huang-Wei @kerthcet pls review, thanks a lot. And this bug may need cherry-pick to lower k8s version?

Does this PR introduce a user-facing change?

Fixed a suboptimal scheduler preemption behavior where potential preemption victims were violating Pod Disruption Budgets.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Oct 24, 2024
Copy link

linux-foundation-easycla bot commented Oct 24, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: NoicFank / name: Dingzhu Lurong (68f7a7c)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 24, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2024
@NoicFank NoicFank force-pushed the bugfix-scheduler-preemption branch from c1fdbbb to dadba22 Compare October 24, 2024 06:47
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 24, 2024
@NoicFank
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024
@NoicFank NoicFank force-pushed the bugfix-scheduler-preemption branch from dadba22 to a458682 Compare October 24, 2024 06:55
@kerthcet
Copy link
Member

/assign

@NoicFank NoicFank force-pushed the bugfix-scheduler-preemption branch 2 times, most recently from 5d42abd to 52de4e4 Compare October 24, 2024 08:56
@NoicFank
Copy link
Contributor Author

/retest

@sftim
Copy link
Contributor

sftim commented Oct 27, 2024

Could a cluster administrator notice this fix has been made? If so, we should changelog it.

Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the caching. Really appreciate the fix and definitely UT!

@@ -225,6 +227,9 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
return nil, 0, framework.AsStatus(err)
}
}

// Sort victims after reprieving pods to keep the victims sorted by pod priority from high to low.
sort.Slice(victims, func(i, j int) bool { return util.MoreImportantPod(victims[i], victims[j]) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic is fine, however, we can potentially optimize this to avoid unnecessary sorting:

  1. thorough fix to remove sorting: in filterPodsWithPDBViolation(), return a custom struct to wrap PodInfo to carry a "isPDBViolated" flag so as to return a single slice; and hence we don't need two for loop, and hence no need to re-sort at the end of this function.
  2. only do sorting when len(violatingVictims) != 0 && len(nonViolatingVictims) != 0.

cc @alculquicondor @sanposhiho

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, updated. Remove additional sorting and reduce time complexity from O(n)log(n) to O(n).

Please review, thanks. Feel free to comment if there are any better ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?

Copy link
Contributor Author

@NoicFank NoicFank Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?

Thanks review~ I'm ok for both impls. Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n). But add another sorting has less code intrusion,which benefits for understanding and cherry-pick...

Look forward for more good ideas or advices~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current impl. doesn't qualify as a cherry-pickable fix. Actually, it doesn't quite follow the proposal 1 in my previous comment.

In proposal 1, I suggested to only change filterPodsWithPDBViolation() to return one slice, sth. like this:

diff --git a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
index f1c58de9389..a255aef74a8 100644
--- a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
+++ b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
@@ -135,6 +135,11 @@ func (pl *DefaultPreemption) CandidatesToVictimsMap(candidates []preemption.Cand
 	return m
 }
 
+type potentialVictim struct {
+	podInfo     *framework.PodInfo
+	pdbViolated bool
+}
+
 // SelectVictimsOnNode finds minimum set of pods on the given node that should be preempted in order to make enough room
 // for "pod" to be scheduled.
 func (pl *DefaultPreemption) SelectVictimsOnNode(
@@ -195,7 +200,7 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
 	// Try to reprieve as many pods as possible. We first try to reprieve the PDB
 	// violating victims and then other non-violating ones. In both cases, we start
 	// from the highest priority victims.
-	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
+	finalVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
 	reprievePod := func(pi *framework.PodInfo) (bool, error) {
 		if err := addPod(pi); err != nil {
 			return false, err
@@ -212,19 +217,15 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
 		}
 		return fits, nil
 	}
-	for _, p := range violatingVictims {
-		if fits, err := reprievePod(p); err != nil {
+
+	for _, v := range finalVictims {
+		if fits, err := reprievePod(v.podInfo); err != nil {
 			return nil, 0, framework.AsStatus(err)
-		} else if !fits {
+		} else if !fits && v.pdbViolated {
 			numViolatingVictim++
 		}
 	}
-	// Now we try to reprieve non-violating victims.
-	for _, p := range nonViolatingVictims {
-		if _, err := reprievePod(p); err != nil {
-			return nil, 0, framework.AsStatus(err)
-		}
-	}
+
 	return victims, numViolatingVictim, framework.NewStatus(framework.Success)
 }
 
@@ -287,12 +288,13 @@ func podTerminatingByPreemption(p *v1.Pod) bool {
 // preempted.
 // This function is stable and does not change the order of received pods. So, if it
 // receives a sorted list, grouping will preserve the order of the input list.
-func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) (violatingPodInfos, nonViolatingPodInfos []*framework.PodInfo) {
+func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) []*potentialVictim {
 	pdbsAllowed := make([]int32, len(pdbs))
 	for i, pdb := range pdbs {
 		pdbsAllowed[i] = pdb.Status.DisruptionsAllowed
 	}
 
+	var potentialVictims []*potentialVictim
 	for _, podInfo := range podInfos {
 		pod := podInfo.Pod
 		pdbForPodIsViolated := false
@@ -326,13 +328,15 @@ func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.Po
 				}
 			}
 		}
+		v := &potentialVictim{
+			podInfo: podInfo,
+		}
 		if pdbForPodIsViolated {
-			violatingPodInfos = append(violatingPodInfos, podInfo)
-		} else {
-			nonViolatingPodInfos = append(nonViolatingPodInfos, podInfo)
+			v.pdbViolated = true
 		}
+		potentialVictims = append(potentialVictims, v)
 	}
-	return violatingPodInfos, nonViolatingPodInfos
+	return potentialVictims
 }
 
 func getPDBLister(informerFactory informers.SharedInformerFactory) policylisters.PodDisruptionBudgetLister {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei
ahh, I got you. But I found this impl may reprieve non-violating pods instead of those violating pods.

For example, there are two suitable pods podA(midPriority, nonViolatingPDB) & podB(lowPriority, violatingPDB) for scheduling podC(highPriority). Evicting any of them can make podC schedulable.

Clearly, we should preempt podB which violates PDB. But if we put podA & podB into one slice(finalVictims), we will reprieve podA firstly. Please correct me if I misunderstanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)

But the n is different, one refers to the final victims, another one refers to the potential victims.

Copy link
Member

@Huang-Wei Huang-Wei Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the above snippet was a rough idea to exercise proposal 1. You may need to tweak a bit to adapt to the case to always favor the nodes w/ non-violating victims only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)

But the n is different, one refers to the final victims, another one refers to the potential victims.

yes, correct. Sorting for final victims with O(n1)log(n1), Another for all potential victims O(n2), and n1 <= n2.

Seems sorting is a good way for impl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's go with sorting as a cherry-pickable fix. And revaluate a potential follow-up in main branch.

pkg/scheduler/testing/wrappers.go Outdated Show resolved Hide resolved
@Huang-Wei
Copy link
Member

And this bug may need cherry-pick to lower k8s version?

Yes, it needs to as the behavior is not aligned w/ the code comments (L417):

// pickOneNodeForPreemption chooses one node among the given nodes.
// It assumes pods in each map entry are ordered by decreasing priority.
// If the scoreFuns is not empty, It picks a node based on score scoreFuns returns.
// If the scoreFuns is empty,
// It picks a node based on the following criteria:
// 1. A node with minimum number of PDB violations.
// 2. A node with minimum highest priority victim is picked.
// 3. Ties are broken by sum of priorities of all victims.
// 4. If there are still ties, node with the minimum number of victims is picked.
// 5. If there are still ties, node with the latest start time of all highest priority victims is picked.
// 6. If there are still ties, the first such node is picked (sort of randomly).
// The 'minNodes1' and 'minNodes2' are being reused here to save the memory
// allocation and garbage collection time.
func pickOneNodeForPreemption(logger klog.Logger, nodesToVictims map[string]*extenderv1.Victims, scoreFuncs []func(node string) int64) string {

@Huang-Wei
Copy link
Member

Could a cluster administrator notice this fix has been made? If so, we should changelog it.

Yes, we should add a changelog, something like:

Fixed a suboptimal scheduler preemption behavior where potential preemption victims were violating Pod Disruption Budgets.

@NoicFank NoicFank force-pushed the bugfix-scheduler-preemption branch 2 times, most recently from 90df3b3 to a5875d4 Compare October 28, 2024 07:51
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, NoicFank

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2024
…ority pod on it.

Introducing pdb to preemption had disrupted the orderliness of pods in the victims,
which would leads picking wrong victim node with higher priority pod on it.
@NoicFank NoicFank force-pushed the bugfix-scheduler-preemption branch from bda645e to 68f7a7c Compare October 29, 2024 11:51
@NoicFank
Copy link
Contributor Author

/retest

@Huang-Wei
Copy link
Member

/lgtm

@NoicFank could you help cherrypick it to all supported lower versions?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c481f3d6dfec532e1997d2e340984bde24f05b34

@Huang-Wei
Copy link
Member

/retest

@AxeZhan
Copy link
Member

AxeZhan commented Oct 30, 2024

Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?

Nope. Also 1.27 is no longer supported, so we can close #128437 here.

@NoicFank
Copy link
Contributor Author

Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?

Nope. Also 1.27 is no longer supported, so we can close #128437 here.

ok, thank for reply. already closed.

k8s-ci-robot added a commit that referenced this pull request Nov 12, 2024
…8307-upstream-release-1.31

Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
k8s-ci-robot added a commit that referenced this pull request Nov 12, 2024
…8307-upstream-release-1.29

Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
k8s-ci-robot added a commit that referenced this pull request Nov 12, 2024
…8307-upstream-release-1.30

Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants