-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throughput degradation scheduling daemonset pods #124709
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig scheduling |
Before #119779, there was this line: which also had a map access. So we can quite confidently say that the culprit is just the map insert of kubernetes/pkg/scheduler/schedule_one.go Line 492 in 44bd04c
|
@sanposhiho, @AxeZhan I have two alternatives:
Option 2 also solves #124705. However, external PostFilter plugins need to be aware of the change. But I'm not aware of any in the wild that implements a behavior similar to preemption. We can leave an ACTION REQUIRED. But was this path working in 1.28 and older? Regardless of the decision, I think we should accompany the fix with an integration test for preemption by daemonset pods. |
Can we try (1) first, ask Google for a scalability test again, and then consider (2) if (1) doesn't improve the throughput enough? (2) could break existing PostFilter plugins. Given PostFilter is not only for preemption, in the world, there may be custom PostFilter plugins which have to rely on the status map with UnschedulableAndUnresolvable nodes. So, it'd be best if we could avoid this breaking. |
Since (1) is a really small fix, I'm also thinking if we can run tests after #124714 , and then consider (2) base on the test results.
Are these "integration test" for performance or correctness? |
I was asking for a correctness test |
I wrote a benchmark test called The performance without case(1) is:
The performance of case (1) is:
(1) may improve throughput |
I guess we can have a different tracking issue for an integration benchmark. This issue itself is fixed. /close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen Still observing poor performance in large clusters post #124714. I'm investigating further improvements. |
@gabesaba: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@alculquicondor: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
As I asked here, can we add a test case to scheduler_perf to reproduce this degradation? Also, more generally, would it be possible to copy Google’s internal scalability tests to k/k somehow? (already talked to Aldo before) There’ve been multiple times that Google found a scalability issue with their internal tests. |
- improve memory locality fetching node name - avoid set membership check - avoid allocating duplicate status (potentially unsafe) primary remaining hotspot is mapassign_faststr
Currently, scheduler_perf does not seem to be able to reproduce the current scenario. Should we first improve scheduler_perf to make it similar to daemonset, such as labeling each node with a specific label or specifying a dedicated node for the pod to achieve the k/k effect? |
We plan to do it as a follow up. For now, we have the urgent need to get this back to baseline. |
Yep, I didn't mean to block the patch PR for scheduler_perf improvement. Doing this as a follow up is totally fine.
We can improve scheduler_perf engine if needed. But, I'm wondering if we could reproduce the degradation just with a simple test case in which creating 5k Pods that have NodeAffinity matching with only one Node among many Nodes (<- @alculquicondor how many Nodes did you have in the scalable test?). I believe the current scheduler_perf is expressive enough to define such test case. |
We had 15k nodes with 6 Daemonsets, so 90k Pods. |
I could reproduce the degradation with a new test case (15k nodes / 30k pods) in scheduler_perf #125293 (WIP) and confirmed #125197 solves the degradation. Master {
"data": {
"Average": 120.80130209885886,
"Perc50": 128.88163639395216,
"Perc90": 155.10790546767575,
"Perc95": 161.00505378763333,
"Perc99": 169.00350513269643
},
"unit": "pods/s",
"labels": {
"Metric": "SchedulingThroughput",
"Name": "SchedulingNodeAffinity/15000Nodes/namespace-2",
"extension_point": "not applicable",
"plugin": "not applicable",
"result": "not applicable"
}
}, After #125197 {
"data": {
"Average": 552.6016212924262,
"Perc50": 474.34585036045274,
"Perc90": 877.1160813824991,
"Perc95": 1121.327881099065,
"Perc99": 1457.5364518479603
},
"unit": "pods/s",
"labels": {
"Metric": "SchedulingThroughput",
"Name": "SchedulingNodeAffinity/15000Nodes/namespace-2",
"extension_point": "not applicable",
"plugin": "not applicable",
"result": "not applicable"
}
}, |
/reopen I think we still need to implement option2 and some follow-upsof #125197. @gabesaba Will you continue working on this? |
@AxeZhan: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
#125197 is option 2 But yes, there are some follow ups. |
Can we close it? Any remaining task here? |
Followup is tracked in #125345, so I think it's fine to close it. |
@AxeZhan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What happened?
#119779 added some map queries and creations that add non-negligible latency.
The pprof reveals the following lines inside
findNodesThatFitPod
as too expensive:kubernetes/pkg/scheduler/schedule_one.go
Line 492 in 44bd04c
kubernetes/pkg/scheduler/schedule_one.go
Line 489 in 44bd04c
What did you expect to happen?
Scheduler to keep a throughput of 300 pods/s
How can we reproduce it (as minimally and precisely as possible)?
Schedule 5k daemonset pods giving 300 qps to the scheduler
Anything else we need to know?
No response
Kubernetes version
v1.30
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: