bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

NoicFank · 2024-10-24T06:40:45Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

BUG: preemption picks wrong victim node with higher priority pod on it.

For example (pls run test pod can be made schedulable on minHighestPriority node to verify)

when podA with veryHighPriority failed in scheduling, it try to preempt other pods. currently there are two nodes that meet the preemption requirements:

node1: with two victims: pod1(highPriority) & pod2(lowPriority), while preempting pod1 do not violate PDB, but preempting pod2 will violate PDB.
node2: with one victim: pod3(midPriority), while preempting pod3 will violatePDB.

Both node1 & node2 has one pod violate PDB if preempting them. Then a node with minimum highest priority victim is picked, currently the highest priority victim on :

node1 is highPriority(pod1)
node2 is midPriority(pod2)

Then, it clearly, preemption should pick node2. But currently it will select wrong node node1. And pod1 with highPriority will be preempted, instead of pod3 with midPriority.

Which issue(s) this PR fixes:

Fixes #NONE

Special notes for your reviewer:

@Huang-Wei @kerthcet pls review, thanks a lot. And this bug may need cherry-pick to lower k8s version?

Does this PR introduce a user-facing change?

Fixed a suboptimal scheduler preemption behavior where potential preemption victims were violating Pod Disruption Budgets.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

linux-foundation-easycla · 2024-10-24T06:40:49Z

The committers listed above are authorized under a signed CLA.

✅ login: NoicFank / name: Dingzhu Lurong (68f7a7c)

NoicFank · 2024-10-24T06:48:13Z

/triage accepted

kerthcet · 2024-10-24T07:41:12Z

/assign

NoicFank · 2024-10-24T09:44:12Z

/retest

sftim · 2024-10-27T17:24:14Z

Could a cluster administrator notice this fix has been made? If so, we should changelog it.

Huang-Wei

Thanks for the caching. Really appreciate the fix and definitely UT!

Huang-Wei · 2024-10-28T01:29:50Z

pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

@@ -225,6 +227,9 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
 			return nil, 0, framework.AsStatus(err)
 		}
 	}
+
+	// Sort victims after reprieving pods to keep the victims sorted by pod priority from high to low.
+	sort.Slice(victims, func(i, j int) bool { return util.MoreImportantPod(victims[i], victims[j]) })


Logic is fine, however, we can potentially optimize this to avoid unnecessary sorting:

thorough fix to remove sorting: in filterPodsWithPDBViolation(), return a custom struct to wrap PodInfo to carry a "isPDBViolated" flag so as to return a single slice; and hence we don't need two for loop, and hence no need to re-sort at the end of this function.

only do sorting when len(violatingVictims) != 0 && len(nonViolatingVictims) != 0.

cc @alculquicondor @sanposhiho

ok, updated. Remove additional sorting and reduce time complexity from O(n)log(n) to O(n).

Please review, thanks. Feel free to comment if there are any better ways.

Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?

Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?

Thanks review~ I'm ok for both impls. Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n). But add another sorting has less code intrusion，which benefits for understanding and cherry-pick...

Look forward for more good ideas or advices~

The current impl. doesn't qualify as a cherry-pickable fix. Actually, it doesn't quite follow the proposal 1 in my previous comment.

In proposal 1, I suggested to only change filterPodsWithPDBViolation() to return one slice, sth. like this:

diff --git a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go index f1c58de9389..a255aef74a8 100644 --- a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go +++ b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go @@ -135,6 +135,11 @@ func (pl *DefaultPreemption) CandidatesToVictimsMap(candidates []preemption.Cand return m } +type potentialVictim struct { + podInfo *framework.PodInfo + pdbViolated bool +} + // SelectVictimsOnNode finds minimum set of pods on the given node that should be preempted in order to make enough room // for "pod" to be scheduled. func (pl *DefaultPreemption) SelectVictimsOnNode( @@ -195,7 +200,7 @@ func (pl *DefaultPreemption) SelectVictimsOnNode( // Try to reprieve as many pods as possible. We first try to reprieve the PDB // violating victims and then other non-violating ones. In both cases, we start // from the highest priority victims. - violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs) + finalVictims := filterPodsWithPDBViolation(potentialVictims, pdbs) reprievePod := func(pi *framework.PodInfo) (bool, error) { if err := addPod(pi); err != nil { return false, err @@ -212,19 +217,15 @@ func (pl *DefaultPreemption) SelectVictimsOnNode( } return fits, nil } - for _, p := range violatingVictims { - if fits, err := reprievePod(p); err != nil { + + for _, v := range finalVictims { + if fits, err := reprievePod(v.podInfo); err != nil { return nil, 0, framework.AsStatus(err) - } else if !fits { + } else if !fits && v.pdbViolated { numViolatingVictim++ } } - // Now we try to reprieve non-violating victims. - for _, p := range nonViolatingVictims { - if _, err := reprievePod(p); err != nil { - return nil, 0, framework.AsStatus(err) - } - } + return victims, numViolatingVictim, framework.NewStatus(framework.Success) } @@ -287,12 +288,13 @@ func podTerminatingByPreemption(p *v1.Pod) bool { // preempted. // This function is stable and does not change the order of received pods. So, if it // receives a sorted list, grouping will preserve the order of the input list. -func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) (violatingPodInfos, nonViolatingPodInfos []*framework.PodInfo) { +func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) []*potentialVictim { pdbsAllowed := make([]int32, len(pdbs)) for i, pdb := range pdbs { pdbsAllowed[i] = pdb.Status.DisruptionsAllowed } + var potentialVictims []*potentialVictim for _, podInfo := range podInfos { pod := podInfo.Pod pdbForPodIsViolated := false @@ -326,13 +328,15 @@ func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.Po } } } + v := &potentialVictim{ + podInfo: podInfo, + } if pdbForPodIsViolated { - violatingPodInfos = append(violatingPodInfos, podInfo) - } else { - nonViolatingPodInfos = append(nonViolatingPodInfos, podInfo) + v.pdbViolated = true } + potentialVictims = append(potentialVictims, v) } - return violatingPodInfos, nonViolatingPodInfos + return potentialVictims } func getPDBLister(informerFactory informers.SharedInformerFactory) policylisters.PodDisruptionBudgetLister {

@Huang-Wei
ahh, I got you. But I found this impl may reprieve non-violating pods instead of those violating pods.

For example, there are two suitable pods podA(midPriority, nonViolatingPDB) & podB(lowPriority, violatingPDB) for scheduling podC(highPriority). Evicting any of them can make podC schedulable.

Clearly, we should preempt podB which violates PDB. But if we put podA & podB into one slice(finalVictims), we will reprieve podA firstly. Please correct me if I misunderstanding.

Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)

But the n is different, one refers to the final victims, another one refers to the potential victims.

Yup, the above snippet was a rough idea to exercise proposal 1. You may need to tweak a bit to adapt to the case to always favor the nodes w/ non-violating victims only.

Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)

But the n is different, one refers to the final victims, another one refers to the potential victims.

yes, correct. Sorting for final victims with O(n1)log(n1), Another for all potential victims O(n2), and n1 <= n2.

Seems sorting is a good way for impl.

Yup, let's go with sorting as a cherry-pickable fix. And revaluate a potential follow-up in main branch.

pkg/scheduler/testing/wrappers.go

Huang-Wei · 2024-10-28T01:33:32Z

And this bug may need cherry-pick to lower k8s version?

Yes, it needs to as the behavior is not aligned w/ the code comments (L417):

kubernetes/pkg/scheduler/framework/preemption/preemption.go

Lines 411 to 424 in 1148e5e

    
           // pickOneNodeForPreemption chooses one node among the given nodes. 
        
           // It assumes pods in each map entry are ordered by decreasing priority. 
        
           // If the scoreFuns is not empty, It picks a node based on score scoreFuns returns. 
        
           // If the scoreFuns is empty, 
        
           // It picks a node based on the following criteria: 
        
           // 1. A node with minimum number of PDB violations. 
        
           // 2. A node with minimum highest priority victim is picked. 
        
           // 3. Ties are broken by sum of priorities of all victims. 
        
           // 4. If there are still ties, node with the minimum number of victims is picked. 
        
           // 5. If there are still ties, node with the latest start time of all highest priority victims is picked. 
        
           // 6. If there are still ties, the first such node is picked (sort of randomly). 
        
           // The 'minNodes1' and 'minNodes2' are being reused here to save the memory 
        
           // allocation and garbage collection time. 
        
           func pickOneNodeForPreemption(logger klog.Logger, nodesToVictims map[string]*extenderv1.Victims, scoreFuncs []func(node string) int64) string {

Huang-Wei · 2024-10-28T01:36:46Z

Could a cluster administrator notice this fix has been made? If so, we should changelog it.

Yes, we should add a changelog, something like:

Fixed a suboptimal scheduler preemption behavior where potential preemption victims were violating Pod Disruption Budgets.

k8s-ci-robot · 2024-10-29T11:05:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, NoicFank

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [kerthcet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ority pod on it. Introducing pdb to preemption had disrupted the orderliness of pods in the victims, which would leads picking wrong victim node with higher priority pod on it.

NoicFank · 2024-10-29T12:57:46Z

/retest

Huang-Wei · 2024-10-29T16:52:14Z

/lgtm

@NoicFank could you help cherrypick it to all supported lower versions?

k8s-ci-robot · 2024-10-29T16:52:21Z

LGTM label has been added.

Git tree hash: c481f3d6dfec532e1997d2e340984bde24f05b34

Huang-Wei · 2024-10-29T17:49:09Z

/retest

NoicFank · 2024-10-30T08:08:20Z

/lgtm

@NoicFank could you help cherrypick it to all supported lower versions?

Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?

release-1.31 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128431
release-1.30 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128434
release-1.29 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128435
release-1.28 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128436
release-1.27 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128437

cc @kerthcet @Huang-Wei

AxeZhan · 2024-10-30T08:15:54Z

Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?

Nope. Also 1.27 is no longer supported, so we can close #128437 here.

NoicFank · 2024-10-30T08:19:20Z

Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?

Nope. Also 1.27 is no longer supported, so we can close #128437 here.

ok, thank for reply. already closed.

…8307-upstream-release-1.31 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it

…8307-upstream-release-1.29 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it

…8307-upstream-release-1.30 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Oct 24, 2024

k8s-ci-robot requested review from AxeZhan and denkensk October 24, 2024 06:41

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2024

NoicFank force-pushed the bugfix-scheduler-preemption branch from c1fdbbb to dadba22 Compare October 24, 2024 06:47

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 24, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024

NoicFank force-pushed the bugfix-scheduler-preemption branch from dadba22 to a458682 Compare October 24, 2024 06:55

k8s-ci-robot assigned kerthcet Oct 24, 2024

NoicFank force-pushed the bugfix-scheduler-preemption branch 2 times, most recently from 5d42abd to 52de4e4 Compare October 24, 2024 08:56

Huang-Wei reviewed Oct 28, 2024

View reviewed changes

NoicFank force-pushed the bugfix-scheduler-preemption branch 2 times, most recently from 90df3b3 to a5875d4 Compare October 28, 2024 07:51

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2024

bugfix(scheduler): preemption picks wrong victim node with higher pri…

68f7a7c

…ority pod on it. Introducing pdb to preemption had disrupted the orderliness of pods in the victims, which would leads picking wrong victim node with higher priority pod on it.

NoicFank force-pushed the bugfix-scheduler-preemption branch from bda645e to 68f7a7c Compare October 29, 2024 11:51

k8s-ci-robot assigned Huang-Wei Oct 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2024

k8s-ci-robot merged commit 9887699 into kubernetes:master Oct 29, 2024
14 checks passed

k8s-ci-robot added this to the v1.32 milestone Oct 29, 2024

This was referenced Nov 24, 2024

Bump kubeadm from v1.30.4 to v1.31.3 kokyhm/kubespray#80

Open

Bump kubectl from v1.30.4 to v1.31.3 kokyhm/kubespray#83

Open

Bump kubelet from v1.30.4 to v1.31.3 kokyhm/kubespray#85

Open

This was referenced Dec 10, 2024

Preemption picks wrong victim node with higher priority pod on it after #128307 #129136

Open

bugfix(scheduler): Preemption picks wrong victim node with higher priority pod on it after #128307 #129137

Closed

googs1025 mentioned this pull request Dec 13, 2024

bug(preemption): fix SelectVictimsOnNode method consistent with default scheduler kubernetes-sigs/scheduler-plugins#838

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

NoicFank commented Oct 24, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 24, 2024 •

edited

Loading

NoicFank commented Oct 24, 2024

kerthcet commented Oct 24, 2024

NoicFank commented Oct 24, 2024

sftim commented Oct 27, 2024

Huang-Wei left a comment

Huang-Wei Oct 28, 2024

NoicFank Oct 28, 2024

kerthcet Oct 28, 2024

NoicFank Oct 29, 2024 •

edited

Loading

Huang-Wei Oct 29, 2024

NoicFank Oct 29, 2024

kerthcet Oct 29, 2024

Huang-Wei Oct 29, 2024 •

edited

Loading

NoicFank Oct 29, 2024

Huang-Wei Oct 29, 2024

Huang-Wei commented Oct 28, 2024

Huang-Wei commented Oct 28, 2024

k8s-ci-robot commented Oct 29, 2024

NoicFank commented Oct 29, 2024

Huang-Wei commented Oct 29, 2024

k8s-ci-robot commented Oct 29, 2024

Huang-Wei commented Oct 29, 2024

NoicFank commented Oct 30, 2024

AxeZhan commented Oct 30, 2024

NoicFank commented Oct 30, 2024

bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307

Conversation

NoicFank commented Oct 24, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Oct 24, 2024 • edited Loading

NoicFank commented Oct 24, 2024

kerthcet commented Oct 24, 2024

NoicFank commented Oct 24, 2024

sftim commented Oct 27, 2024

Huang-Wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NoicFank Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Huang-Wei commented Oct 28, 2024

Huang-Wei commented Oct 28, 2024

k8s-ci-robot commented Oct 29, 2024

NoicFank commented Oct 29, 2024

Huang-Wei commented Oct 29, 2024

k8s-ci-robot commented Oct 29, 2024

Huang-Wei commented Oct 29, 2024

NoicFank commented Oct 30, 2024

AxeZhan commented Oct 30, 2024

NoicFank commented Oct 30, 2024

NoicFank commented Oct 24, 2024 •

edited

Loading

linux-foundation-easycla bot commented Oct 24, 2024 •

edited

Loading

NoicFank Oct 29, 2024 •

edited

Loading

Huang-Wei Oct 29, 2024 •

edited

Loading