-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307
bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it #128307
Conversation
|
c1fdbbb
to
dadba22
Compare
/triage accepted |
dadba22
to
a458682
Compare
/assign |
5d42abd
to
52de4e4
Compare
/retest |
Could a cluster administrator notice this fix has been made? If so, we should changelog it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the caching. Really appreciate the fix and definitely UT!
@@ -225,6 +227,9 @@ func (pl *DefaultPreemption) SelectVictimsOnNode( | |||
return nil, 0, framework.AsStatus(err) | |||
} | |||
} | |||
|
|||
// Sort victims after reprieving pods to keep the victims sorted by pod priority from high to low. | |||
sort.Slice(victims, func(i, j int) bool { return util.MoreImportantPod(victims[i], victims[j]) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic is fine, however, we can potentially optimize this to avoid unnecessary sorting:
- thorough fix to remove sorting: in filterPodsWithPDBViolation(), return a custom struct to wrap PodInfo to carry a "isPDBViolated" flag so as to return a single slice; and hence we don't need two
for
loop, and hence no need to re-sort at the end of this function. - only do sorting when
len(violatingVictims) != 0 && len(nonViolatingVictims) != 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, updated. Remove additional sorting and reduce time complexity from O(n)log(n) to O(n).
Please review, thanks. Feel free to comment if there are any better ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another sorting is not a bad idea for cherry-pick? What do you guys think? And we still need to loop twice, no?
Thanks review~ I'm ok for both impls. Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n). But add another sorting has less code intrusion,which benefits for understanding and cherry-pick...
Look forward for more good ideas or advices~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current impl. doesn't qualify as a cherry-pickable fix. Actually, it doesn't quite follow the proposal 1 in my previous comment.
In proposal 1, I suggested to only change filterPodsWithPDBViolation() to return one slice, sth. like this:
diff --git a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
index f1c58de9389..a255aef74a8 100644
--- a/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
+++ b/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
@@ -135,6 +135,11 @@ func (pl *DefaultPreemption) CandidatesToVictimsMap(candidates []preemption.Cand
return m
}
+type potentialVictim struct {
+ podInfo *framework.PodInfo
+ pdbViolated bool
+}
+
// SelectVictimsOnNode finds minimum set of pods on the given node that should be preempted in order to make enough room
// for "pod" to be scheduled.
func (pl *DefaultPreemption) SelectVictimsOnNode(
@@ -195,7 +200,7 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
// Try to reprieve as many pods as possible. We first try to reprieve the PDB
// violating victims and then other non-violating ones. In both cases, we start
// from the highest priority victims.
- violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
+ finalVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
reprievePod := func(pi *framework.PodInfo) (bool, error) {
if err := addPod(pi); err != nil {
return false, err
@@ -212,19 +217,15 @@ func (pl *DefaultPreemption) SelectVictimsOnNode(
}
return fits, nil
}
- for _, p := range violatingVictims {
- if fits, err := reprievePod(p); err != nil {
+
+ for _, v := range finalVictims {
+ if fits, err := reprievePod(v.podInfo); err != nil {
return nil, 0, framework.AsStatus(err)
- } else if !fits {
+ } else if !fits && v.pdbViolated {
numViolatingVictim++
}
}
- // Now we try to reprieve non-violating victims.
- for _, p := range nonViolatingVictims {
- if _, err := reprievePod(p); err != nil {
- return nil, 0, framework.AsStatus(err)
- }
- }
+
return victims, numViolatingVictim, framework.NewStatus(framework.Success)
}
@@ -287,12 +288,13 @@ func podTerminatingByPreemption(p *v1.Pod) bool {
// preempted.
// This function is stable and does not change the order of received pods. So, if it
// receives a sorted list, grouping will preserve the order of the input list.
-func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) (violatingPodInfos, nonViolatingPodInfos []*framework.PodInfo) {
+func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.PodDisruptionBudget) []*potentialVictim {
pdbsAllowed := make([]int32, len(pdbs))
for i, pdb := range pdbs {
pdbsAllowed[i] = pdb.Status.DisruptionsAllowed
}
+ var potentialVictims []*potentialVictim
for _, podInfo := range podInfos {
pod := podInfo.Pod
pdbForPodIsViolated := false
@@ -326,13 +328,15 @@ func filterPodsWithPDBViolation(podInfos []*framework.PodInfo, pdbs []*policy.Po
}
}
}
+ v := &potentialVictim{
+ podInfo: podInfo,
+ }
if pdbForPodIsViolated {
- violatingPodInfos = append(violatingPodInfos, podInfo)
- } else {
- nonViolatingPodInfos = append(nonViolatingPodInfos, podInfo)
+ v.pdbViolated = true
}
+ potentialVictims = append(potentialVictims, v)
}
- return violatingPodInfos, nonViolatingPodInfos
+ return potentialVictims
}
func getPDBLister(informerFactory informers.SharedInformerFactory) policylisters.PodDisruptionBudgetLister {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei
ahh, I got you. But I found this impl may reprieve non-violating pods instead of those violating pods.
For example, there are two suitable pods podA(midPriority, nonViolatingPDB) & podB(lowPriority, violatingPDB) for scheduling podC(highPriority). Evicting any of them can make podC schedulable.
Clearly, we should preempt podB which violates PDB. But if we put podA & podB into one slice(finalVictims), we will reprieve podA firstly. Please correct me if I misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)
But the n is different, one refers to the final victims, another one refers to the potential victims.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the above snippet was a rough idea to exercise proposal 1. You may need to tweak a bit to adapt to the case to always favor the nodes w/ non-violating victims only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current impl in MR is more elegant, it drops sorting and uses one for loop instead, which reduces time complexity from O(n)log(n) to O(n)
But the n is different, one refers to the final victims, another one refers to the potential victims.
yes, correct. Sorting for final victims with O(n1)log(n1), Another for all potential victims O(n2), and n1 <= n2.
Seems sorting is a good way for impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, let's go with sorting as a cherry-pickable fix. And revaluate a potential follow-up in main branch.
Yes, it needs to as the behavior is not aligned w/ the code comments (L417): kubernetes/pkg/scheduler/framework/preemption/preemption.go Lines 411 to 424 in 1148e5e
|
Yes, we should add a changelog, something like:
|
90df3b3
to
a5875d4
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kerthcet, NoicFank The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ority pod on it. Introducing pdb to preemption had disrupted the orderliness of pods in the victims, which would leads picking wrong victim node with higher priority pod on it.
bda645e
to
68f7a7c
Compare
/retest |
/lgtm @NoicFank could you help cherrypick it to all supported lower versions? |
LGTM label has been added. Git tree hash: c481f3d6dfec532e1997d2e340984bde24f05b34
|
/retest |
Of course! I cherry-pick this fix into release from 1.27 to 1.31. need backport to k8s version lower than 1.27?
|
Nope. Also 1.27 is no longer supported, so we can close #128437 here. |
ok, thank for reply. already closed. |
…8307-upstream-release-1.31 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
…8307-upstream-release-1.29 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
…8307-upstream-release-1.30 Automated cherry pick of #128307: bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
What type of PR is this?
/kind bug
What this PR does / why we need it:
BUG: preemption picks wrong victim node with higher priority pod on it.
For example (pls run test
pod can be made schedulable on minHighestPriority node
to verify)when podA with veryHighPriority failed in scheduling, it try to preempt other pods. currently there are two nodes that meet the preemption requirements:
Both node1 & node2 has one pod violate PDB if preempting them. Then a node with minimum highest priority victim is picked, currently the highest priority victim on :
Then, it clearly, preemption should pick node2. But currently it will select wrong node node1. And pod1 with highPriority will be preempted, instead of pod3 with midPriority.
Which issue(s) this PR fixes:
Fixes #NONE
Special notes for your reviewer:
@Huang-Wei @kerthcet pls review, thanks a lot. And this bug may need cherry-pick to lower k8s version?
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: