Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable #125197

Merged
merged 2 commits into from
Jun 3, 2024

Conversation

gabesaba
Copy link
Contributor

@gabesaba gabesaba commented May 29, 2024

What type of PR is this?

/kind bug
/kind regression
/kind api-change

What this PR does / why we need it:

#119779 fixed a bug, but caused a performance regression #124709 observed at 5k nodes. Performance fix #124714 was merged, with modest improvement in performance. We still observe reduced throughput when running a test (15k nodes, 60k daemonset pods)

baseline (pre #119779): ~470 pods/s
current (with #124714): ~70 pods/s
more perf engineering: ~300 pods/s
this change: ~460 pods/s

This fix attempts to bring us back to baseline performance. We revert #124714, and part of #119779. We implement option 2 proposed here. While there are two unaddressed O(n) operations (1, 2), these haven't revealed themselves as performance problems in the wild. To keep this diff as small as possible for cherry-pick, we will defer the fix of those to a future minor version. This future change will require a breaking change to the NodeToStatusMap type, to allow better than O(n), or at least really fast O(n), representation of many nodes with the same status.

pair @mskrocki
/assign @alculquicondor, @liggitt, @Huang-Wei
/sig scheduling

Which issue(s) this PR fixes:

Fixes #124709

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixes a scheduling performance regression when many nodes exist and prefilter returns 1-2 nodes (e.g. daemonset scheduling)

ACTION REQUIRED: For developers of out-of-tree PostFilter plugins, note that the semantics of NodeToStatusMap are changing: A node with an absent value in the NodeToStatusMap should be interpreted as having an UnschedulableAndUnresolvable status

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/regression Categorizes issue or PR as related to a regression from a prior release. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 29, 2024
@k8s-ci-robot k8s-ci-robot requested review from damemi and kerthcet May 29, 2024 16:19
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 29, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @gabesaba. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gabesaba
Copy link
Contributor Author

/assign @alculquicondor
/assign @liggitt
/assign @Huang-Wei

@liggitt
Copy link
Member

liggitt commented May 29, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2024
@alculquicondor
Copy link
Member

cc @sanposhiho @chengjoey

@alculquicondor
Copy link
Member

/release-note-edit

Improved scheduling performance when many nodes, and prefilter returns 1-2 nodes (e.g. daemonset)

ACTION REQUIRED: For developers of out-of-tree PostFilter plugins, note that the semantics of NodeToStatusMap are changing: A node with an absent value in the NodeToStatusMap should be interpreted as having an UnschedulableAndUnresolvable status

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: /release-note-edit must be used with a release note block.

In response to this:

/release-note-edit

Improved scheduling performance when many nodes, and prefilter returns 1-2 nodes (e.g. daemonset)

ACTION REQUIRED: For developers of out-of-tree PostFilter plugins, note that the semantics of NodeToStatusMap are changing: A node with an absent value in the NodeToStatusMap should be interpreted as having an UnschedulableAndUnresolvable status

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alculquicondor
Copy link
Member

/release-note-edit

Improved scheduling performance when many nodes, and prefilter returns 1-2 nodes (e.g. daemonset)

ACTION REQUIRED: For developers of out-of-tree PostFilter plugins, note that the semantics of NodeToStatusMap are changing: A node with an absent value in the NodeToStatusMap should be interpreted as having an UnschedulableAndUnresolvable status

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 29, 2024
@liggitt
Copy link
Member

liggitt commented May 29, 2024

looks like it needs hack/update-gofmt.sh run:

diff ./pkg/scheduler/schedule_one_test.go.orig ./pkg/scheduler/schedule_one_test.go
--- ./pkg/scheduler/schedule_one_test.go.orig
+++ ./pkg/scheduler/schedule_one_test.go
@@ -2452,8 +2452,7 @@
 				Pod:         st.MakePod().Name("test-prefilter").UID("test-prefilter").Obj(),
 				NumAllNodes: 2,
 				Diagnosis: framework.Diagnosis{
-					NodeToStatusMap: framework.NodeToStatusMap{
-					},
+					NodeToStatusMap: framework.NodeToStatusMap{},
 				},
 			},
 		},

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.


diagnosis := framework.Diagnosis{
NodeToStatusMap: make(framework.NodeToStatusMap, len(allNodes)),
return nil, diagnosis, err
}
// Run "prefilter" plugins.
preRes, s := fwk.RunPreFilterPlugins(ctx, state, pod)
Copy link
Member

@AxeZhan AxeZhan May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on an unchanged line.
So based on our implementation, in L460, we should only set status for all nodes if the status returned by fwk.RunPreFilterPlugins(ctx, state, pod) has Unschedulable code, right? Which can only happen in a scheduler with specific out-of-tree plugins.

In fact, I think we can only list allnodes and update diagnosis.NodeToStatusMap when runprefilter returns an Unschedulable status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds right to me. However, I omited it from this change to keep the diff small, since updating the tests produces a 30 line (+7, -23) diff. Then, the intention is to clean it up in a PR which won't be cherry-picked.

Do you prefer I include it in this change, or do you think the original plan makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, as we're going to cherry-pick this to recent releases. I agree that this should be as simple as possible(Provided it can bring back our previous performance).
I think we can leave a comment here, and leave it for a follow up.

nodeStatuses[nodeName] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "Preemption is not helpful for scheduling")
continue
}
potentialNodes = append(potentialNodes, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me if this is a dumb question:
Since we only add the node in framework.NodeToStatusMap to potentialNodes.
Why don't we iterate framework.NodeToStatusMap directly instead of iterating allNodes and check if the node is in framework.NodeToStatusMap ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the contrary, your comment is absolutely right :)

If we decide to not fill in the map (depending on resolution of #125197 (comment)), I will implement this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok filling up the map for this cherry-pickable patch, but we should follow this suggestion for 1.31.

Copy link
Member

@AxeZhan AxeZhan Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in this PR, we are still filling up the map to minimize changes and facilitate cherry-picking.
However, in version 1.31, we will change the behavior so that we no longer fill up the map with UnschedulableAndUnresolvable status, am I understanding correctly??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the NodeToStatusMap returned by nodesWherePreemptionMayHelp. UnschedulableAndUnresolvable status is only used for the error message (#nodes, and #"Preemption is not helpful for scheduling"). I think we can pipe this information in a more efficient way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sgtm.

@gabesaba
Copy link
Contributor Author

gabesaba commented Jun 3, 2024

/retest

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Please prepare cherry-picks for all supported versions

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 3, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 539b029ec9b38a8a86a8a5e323e17bee4b648401

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, gabesaba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2024
@k8s-ci-robot k8s-ci-robot merged commit 8bd36c6 into kubernetes:master Jun 3, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 3, 2024
@gabesaba gabesaba deleted the prefilter_perf branch June 3, 2024 15:06
k8s-ci-robot added a commit that referenced this pull request Jun 5, 2024
…5197-upstream-release-1.30

Cherry pick of #125197: [scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable
k8s-ci-robot added a commit that referenced this pull request Jun 5, 2024
Cherry pick of #125197: [scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable
k8s-ci-robot added a commit that referenced this pull request Jun 5, 2024
Cherry pick of #125197: [scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable
k8s-ci-robot added a commit that referenced this pull request Jun 5, 2024
Cherry pick of #125197: [scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Throughput degradation scheduling daemonset pods
9 participants