-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 'Schedulercache is corrupted' error #55262
Conversation
cc @kubernetes/sig-scheduling-bugs @kubernetes/sig-scheduling-pr-reviews |
/retest |
/assign @wojtek-t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm - minor comment.
t.Fatalf("AddPod failed: %v", err) | ||
} | ||
} | ||
cache.cleanupAssumedPods(now.Add(2 * ttl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you up the ttl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is simulating passing of time to ensure the added pod doesn't expire. was copy/paste from previous testcase, though, and not really relevant to this one, so I can remove it
Removing label |
e747e16
to
a9b08fa
Compare
updated to remove unrelated ttl bit from unit test |
CRD Flake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change looks good to me, I just wonder if Scheduler resource accounting remains valid in such scenario where a pod is assumed to a different node that it is bound to.
} | ||
for _, podToUpdate := range tt.podsToUpdate { | ||
if err := cache.UpdatePod(podToUpdate[0], podToUpdate[1]); err != nil { | ||
t.Fatalf("AddPod failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/AddPod/UpdatePod/
a9b08fa
to
5048838
Compare
fixed gofmt error and test message, re-tagging |
It was just the podStates' version of the pod that wasn't getting updated. The removePod(currState.pod)/addPod(pod) dance corrects the per-node accounting already (and the unit test I added demonstrates the per-node accounting is corrected): kubernetes/plugin/pkg/scheduler/schedulercache/cache.go Lines 234 to 241 in 454074d
|
@liggitt looks like there's still a gofmt issue on |
Thanks, @liggitt! /lgtm |
5048838
to
a366e6c
Compare
actually fixed gofmt error |
reapplying LGTM |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, cblecker, liggitt, timothysc Associated issue: 50916 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
uhhh..
That's an odd unit test failure. Let's retry. |
unit test failing: #55276 |
fingers crossed |
TestCRD flake |
…2-upstream-release-1.8 Automated cherry pick of #55262
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
Fixes #50916
If an Assume()ed pod is Add()ed with a different nodeName, the podStates view of the pod is not corrected to reflect the actual nodeName. On the next Update(), the scheduler observes the mismatch and process exits.