Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected #126644

Huang-Wei · 2024-08-12T22:42:18Z

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

Pod's status was incorrectly patched, which blocks the further deletion, and hence preemption doesn't work. It's a typo regression introduced in v1.29 in #121103.

Which issue(s) this PR fixes:

Reported by #126643

Special notes for your reviewer:

I didn't include test as 1) it's an obvious typo, and 2) in UT and integration test we don't have enforced API validation check, and hence victim would be deleted immediately after being patched, which makes it hard to verify the status.

Does this PR introduce a user-facing change?

Fix a 1.29 scheduler preemption regression where the victim pod was not deleted due to incorrect status patching. This issue occurred when the preemptor and victim pods had different QoS classes in their status, causing the preemption to fail entirely.

k8s-ci-robot · 2024-08-12T22:42:22Z

Please note that we're already in Test Freeze for the release-1.31 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.31.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Aug 12 17:37:05 UTC 2024.

k8s-ci-robot · 2024-08-12T22:42:27Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Huang-Wei · 2024-08-12T22:44:15Z

cc @alculquicondor @mimowo

xiazhan

Thanks for quick fix.

k8s-ci-robot · 2024-08-12T22:58:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, xiazhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2024-08-13T05:18:12Z

/lgtm
Thank you for the fix!

k8s-ci-robot · 2024-08-13T05:18:19Z

LGTM label has been added.

Git tree hash: 58e34b39a055e45292ac3fd14cb9a6ceff9a0fcd

alculquicondor · 2024-08-13T17:00:09Z

Are bugfixes candidates for test freeze?

alculquicondor · 2024-08-13T17:01:57Z

I suppose we have to wait for the next set of patch releases.

Huang-Wei · 2024-08-13T20:24:45Z

I suppose we have to wait for the next set of patch releases.

Yup, it's not a regression introduced in 1.31. Let's wait for code freeze lift-up and cherrypick it back to 1.29 to 1.31.

alculquicondor · 2024-08-14T13:50:05Z

The freeze is over :)

alculquicondor · 2024-08-14T13:51:36Z

In the release notes:

leading to preemption not functioning as expected

Can you be more specific? What is not functioning as expected? Does the preemption not occur at all, or does the status get wiped out in an unpredictable way?

Huang-Wei · 2024-08-14T18:43:00Z

Can you be more specific? What is not functioning as expected? Does the preemption not occur at all, or does the status get wiped out in an unpredictable way?

It's preemption not occur at all as the faulty patch operation would abort the whole scheduling cycle to return Error. Reworded the release notes.

alculquicondor · 2024-08-14T18:44:16Z

That sounds like a major problem, can you prepare cherry-picks?

alculquicondor · 2024-08-14T19:00:44Z

We also need a fix for 1.28 https://github.com/kubernetes/kubernetes/blob/release-1.28/pkg/scheduler/framework/preemption/preemption.go#L365, which hasn't reached EoL, and was broken by #121379

Huang-Wei · 2024-08-14T19:00:52Z

That sounds like a major problem, can you prepare cherry-picks?

Yup, creating now.

Huang-Wei · 2024-08-14T19:01:41Z

We also need a fix for 1.28 release-1.28/pkg/scheduler/framework/preemption/preemption.go#L365, which hasn't reached EoL, and was broken by #121379

Oops, let me create one for 1.28.

…26644-upstream-release-1.31 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

…26644-upstream-release-1.30 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

…26644-upstream-release-1.29 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

…26644-upstream-release-1.28 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

fix a scheduler preemption issue that victim is not patched properly

f6a11da

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 12, 2024

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 12, 2024

k8s-ci-robot requested review from damemi and denkensk August 12, 2024 22:43

xiazhan approved these changes Aug 12, 2024

View reviewed changes

k8s-ci-robot assigned mimowo Aug 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2024

k8s-ci-robot merged commit 03e8154 into kubernetes:master Aug 14, 2024
14 checks passed

k8s-ci-robot added this to the v1.32 milestone Aug 14, 2024

Huang-Wei deleted the fix-preemption branch August 14, 2024 18:26

alculquicondor mentioned this pull request Aug 14, 2024

Use SSA for DisruptionTarget Pod conditions #122294

Closed

Huang-Wei mentioned this pull request Aug 14, 2024

Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched #126695

Merged

k8s-ci-robot added a commit that referenced this pull request Aug 15, 2024

Merge pull request #126691 from Huang-Wei/automated-cherry-pick-of-#1…

60a402c

…26644-upstream-release-1.31 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

k8s-ci-robot added a commit that referenced this pull request Aug 15, 2024

Merge pull request #126693 from Huang-Wei/automated-cherry-pick-of-#1…

cd0ea55

…26644-upstream-release-1.30 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

k8s-ci-robot added a commit that referenced this pull request Aug 15, 2024

Merge pull request #126694 from Huang-Wei/automated-cherry-pick-of-#1…

32f2b29

…26644-upstream-release-1.29 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

k8s-ci-robot added a commit that referenced this pull request Aug 15, 2024

Merge pull request #126695 from Huang-Wei/automated-cherry-pick-of-#1…

9f79836

…26644-upstream-release-1.28 Automated cherry pick of #126644: fix a scheduler preemption issue that victim is not patched

This was referenced Aug 15, 2024

Retriable and non-retriable Pod failures for Jobs kubernetes/enhancements#3329

Closed

kube-scheduler updates pod status mistakenly during preemption #126643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected #126644

Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected #126644

Huang-Wei commented Aug 12, 2024 •

edited by liggitt

Loading

k8s-ci-robot commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

Huang-Wei commented Aug 12, 2024

xiazhan left a comment

k8s-ci-robot commented Aug 12, 2024

mimowo commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

alculquicondor commented Aug 13, 2024

alculquicondor commented Aug 13, 2024

Huang-Wei commented Aug 13, 2024 •

edited

Loading

alculquicondor commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected #126644

Fix a scheduler preemption issue where the victim isn't properly patched, leading to preemption not functioning as expected #126644

Conversation

Huang-Wei commented Aug 12, 2024 • edited by liggitt Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

Huang-Wei commented Aug 12, 2024

xiazhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 12, 2024

mimowo commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

alculquicondor commented Aug 13, 2024

alculquicondor commented Aug 13, 2024

Huang-Wei commented Aug 13, 2024 • edited Loading

alculquicondor commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

alculquicondor commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

Huang-Wei commented Aug 14, 2024

Huang-Wei commented Aug 12, 2024 •

edited by liggitt

Loading

Huang-Wei commented Aug 13, 2024 •

edited

Loading