Throughput degradation scheduling daemonset pods #124709

alculquicondor · 2024-05-06T17:26:20Z

What happened?

#119779 added some map queries and creations that add non-negligible latency.

The pprof reveals the following lines inside findNodesThatFitPod as too expensive:

22.2%

kubernetes/pkg/scheduler/schedule_one.go

Line 492 in 44bd04c

diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
21.7%

kubernetes/pkg/scheduler/schedule_one.go

Line 489 in 44bd04c

if !preRes.NodeNames.Has(n.Node().Name) {

What did you expect to happen?

Scheduler to keep a throughput of 300 pods/s

How can we reproduce it (as minimally and precisely as possible)?

Schedule 5k daemonset pods giving 300 qps to the scheduler

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

v1.30

Cloud provider

GKE

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-05-06T17:26:29Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alculquicondor · 2024-05-06T17:26:37Z

/sig scheduling

alculquicondor · 2024-05-06T17:32:23Z

Before #119779, there was this line:
https://github.com/sanposhiho/kubernetes/blob/5ab23179473b621e911511b03693234b60a38033/pkg/scheduler/schedule_one.go#L489

which also had a map access. So we can quite confidently say that the culprit is just the map insert of

kubernetes/pkg/scheduler/schedule_one.go

Line 492 in 44bd04c

    
           diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")

alculquicondor · 2024-05-06T17:57:25Z

@sanposhiho, @AxeZhan I have two alternatives:

The obvious one is to do a preallocation for the size of the map. I suspect this might be enough, but still somewhat wasteful.
Change the preemption logic to assume that if a Node is not in the status map, it's UnschedulableAndUnresolvable.

kubernetes/pkg/scheduler/framework/preemption/preemption.go

Line 422 in 44bd04c

if m[name].Code() == framework.UnschedulableAndUnresolvable {

Option 2 also solves #124705. However, external PostFilter plugins need to be aware of the change. But I'm not aware of any in the wild that implements a behavior similar to preemption. We can leave an ACTION REQUIRED. But was this path working in 1.28 and older?

Regardless of the decision, I think we should accompany the fix with an integration test for preemption by daemonset pods.

sanposhiho · 2024-05-06T23:51:10Z

Can we try (1) first, ask Google for a scalability test again, and then consider (2) if (1) doesn't improve the throughput enough?

(2) could break existing PostFilter plugins. Given PostFilter is not only for preemption, in the world, there may be custom PostFilter plugins which have to rely on the status map with UnschedulableAndUnresolvable nodes. So, it'd be best if we could avoid this breaking.

AxeZhan · 2024-05-07T05:59:19Z

Since (1) is a really small fix, I'm also thinking if we can run tests after #124714 , and then consider (2) base on the test results.

Regardless of the decision, I think we should accompany the fix with an integration test for preemption by daemonset pods.

Are these "integration test" for performance or correctness?

alculquicondor · 2024-05-07T12:59:49Z

I was asking for a correctness test

chengjoey · 2024-05-07T14:16:43Z

I wrote a benchmark test called Benchmark_findNodesThatFitPod

The performance without case(1) is:

4927125125 ~ 5118241875 ns/op

The performance of case (1) is:

3865283834 ~ 3972690125 ns/op

(1) may improve throughput

alculquicondor · 2024-05-14T15:08:29Z

I guess we can have a different tracking issue for an integration benchmark.

This issue itself is fixed.

/close

k8s-ci-robot · 2024-05-14T15:08:35Z

@alculquicondor: Closing this issue.

In response to this:

I guess we can have a different tracking issue for an integration benchmark.

This issue itself is fixed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

gabesaba · 2024-05-23T12:26:29Z

/reopen
/assign

Still observing poor performance in large clusters post #124714. I'm investigating further improvements.

k8s-ci-robot · 2024-05-23T12:26:33Z

@gabesaba: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/assign

Still observing poor performance in large clusters post #124714. I'm investigating further improvements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alculquicondor · 2024-05-23T15:25:40Z

/reopen

k8s-ci-robot · 2024-05-23T15:25:46Z

@alculquicondor: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sanposhiho · 2024-05-31T11:42:32Z

@gabesaba @chengjoey

As I asked here, can we add a test case to scheduler_perf to reproduce this degradation?

Also, more generally, would it be possible to copy Google’s internal scalability tests to k/k somehow? (already talked to Aldo before) There’ve been multiple times that Google found a scalability issue with their internal tests.
Integrating them into scheduler_perf would be the best; we may need to improve scheduler_perf if its capability isn't enough to incorporate, though.

- improve memory locality fetching node name - avoid set membership check - avoid allocating duplicate status (potentially unsafe) primary remaining hotspot is mapassign_faststr

chengjoey · 2024-05-31T14:26:34Z

@gabesaba @chengjoey

As I asked here, can we add a test case to scheduler_perf to reproduce this degradation?

Also, more generally, would it be possible to copy Google’s internal scalability tests to k/k somehow? (already talked to Aldo before) There’ve been multiple times that Google found a scalability issue with their internal tests. Integrating them into scheduler_perf would be the best; we may need to improve scheduler_perf if its capability isn't enough to incorporate, though.

Currently, scheduler_perf does not seem to be able to reproduce the current scenario. Should we first improve scheduler_perf to make it similar to daemonset, such as labeling each node with a specific label or specifying a dedicated node for the pod to achieve the k/k effect?

alculquicondor · 2024-05-31T16:51:27Z

As I asked #124728 (comment), can we add a test case to scheduler_perf to reproduce this degradation?

We plan to do it as a follow up. For now, we have the urgent need to get this back to baseline.

sanposhiho · 2024-05-31T17:05:06Z

Yep, I didn't mean to block the patch PR for scheduler_perf improvement. Doing this as a follow up is totally fine.

Currently, scheduler_perf does not seem to be able to reproduce the current scenario. Should we first improve scheduler_perf to make it similar to daemonset, such as labeling each node with a specific label or specifying a dedicated node for the pod to achieve the k/k effect?

We can improve scheduler_perf engine if needed. But, I'm wondering if we could reproduce the degradation just with a simple test case in which creating 5k Pods that have NodeAffinity matching with only one Node among many Nodes (<- @alculquicondor how many Nodes did you have in the scalable test?). I believe the current scheduler_perf is expressive enough to define such test case.

alculquicondor · 2024-05-31T17:50:43Z

We had 15k nodes with 6 Daemonsets, so 90k Pods.

sanposhiho · 2024-06-03T14:06:31Z

I could reproduce the degradation with a new test case (15k nodes / 30k pods) in scheduler_perf #125293 (WIP) and confirmed #125197 solves the degradation.

Master

    {
      "data": {
        "Average": 120.80130209885886,
        "Perc50": 128.88163639395216,
        "Perc90": 155.10790546767575,
        "Perc95": 161.00505378763333,
        "Perc99": 169.00350513269643
      },
      "unit": "pods/s",
      "labels": {
        "Metric": "SchedulingThroughput",
        "Name": "SchedulingNodeAffinity/15000Nodes/namespace-2",
        "extension_point": "not applicable",
        "plugin": "not applicable",
        "result": "not applicable"
      }
    },

After #125197

    {
      "data": {
        "Average": 552.6016212924262,
        "Perc50": 474.34585036045274,
        "Perc90": 877.1160813824991,
        "Perc95": 1121.327881099065,
        "Perc99": 1457.5364518479603
      },
      "unit": "pods/s",
      "labels": {
        "Metric": "SchedulingThroughput",
        "Name": "SchedulingNodeAffinity/15000Nodes/namespace-2",
        "extension_point": "not applicable",
        "plugin": "not applicable",
        "result": "not applicable"
      }
    },

AxeZhan · 2024-06-03T15:11:16Z

/reopen

I think we still need to implement option2 and some follow-upsof #125197.

@gabesaba Will you continue working on this?

k8s-ci-robot · 2024-06-03T15:11:21Z

@AxeZhan: Reopened this issue.

In response to this:

/reopen

I think we still need to implement option2 and some follow-upsof #125197.

@gabesaba Will you continue working on this?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alculquicondor · 2024-06-03T15:52:58Z

#125197 is option 2

But yes, there are some follow ups.

sanposhiho · 2024-06-21T07:24:59Z

Can we close it? Any remaining task here?

AxeZhan · 2024-06-21T07:49:08Z

Followup is tracked in #125345, so I think it's fine to close it.
/close

k8s-ci-robot · 2024-06-21T07:49:13Z

@AxeZhan: Closing this issue.

In response to this:

Followup is tracked in #125345, so I think it's fine to close it.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 6, 2024

alculquicondor mentioned this issue May 6, 2024

hotfix when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod scheduling flow will abort immediately. #124559

Merged

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2024

alculquicondor mentioned this issue May 6, 2024

Change EvaluatedNodes to count Nodes that reach Filter phase only #124705

Closed

sanposhiho mentioned this issue May 7, 2024

scheduler: preallocation for NodeToStatusMap #124714

Merged

chengjoey mentioned this issue May 7, 2024

add benchmark test for ScheduleOne #124728

Closed

chengjoey mentioned this issue May 14, 2024

[Flaking Test] UT TestSchedulerGuaranteeNonNilNodeInSchedulingCycle #124591

Closed

k8s-ci-robot closed this as completed May 14, 2024

k8s-ci-robot assigned gabesaba May 23, 2024

k8s-ci-robot reopened this May 23, 2024

gabesaba mentioned this issue May 29, 2024

[scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable #125197

Merged

AxeZhan mentioned this issue May 31, 2024

register unschedulable plugin for those plugins that PreFilter's PreFilterResult filter out some nodes #122251

Merged

sanposhiho mentioned this issue Jun 3, 2024

scheduler_perf: measure the degradation of daemonset scheduling #125293

Merged

k8s-ci-robot closed this as completed in #125197 Jun 3, 2024

k8s-ci-robot reopened this Jun 3, 2024

AxeZhan mentioned this issue Jun 4, 2024

[Scheduler] Change back to original way of calculating EvaluatedNodes. #125303

Merged

gabesaba mentioned this issue Jun 10, 2024

REQUEST: New membership for gabesaba kubernetes/org#5008

Closed

11 tasks

k8s-ci-robot closed this as completed Jun 21, 2024

liggitt added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput degradation scheduling daemonset pods #124709

Throughput degradation scheduling daemonset pods #124709

alculquicondor commented May 6, 2024 •

edited

Loading

k8s-ci-robot commented May 6, 2024

alculquicondor commented May 6, 2024

alculquicondor commented May 6, 2024

alculquicondor commented May 6, 2024

sanposhiho commented May 6, 2024

AxeZhan commented May 7, 2024

alculquicondor commented May 7, 2024

chengjoey commented May 7, 2024

alculquicondor commented May 14, 2024

k8s-ci-robot commented May 14, 2024

gabesaba commented May 23, 2024

k8s-ci-robot commented May 23, 2024

alculquicondor commented May 23, 2024

k8s-ci-robot commented May 23, 2024

sanposhiho commented May 31, 2024

chengjoey commented May 31, 2024

alculquicondor commented May 31, 2024 •

edited

Loading

sanposhiho commented May 31, 2024 •

edited

Loading

alculquicondor commented May 31, 2024

sanposhiho commented Jun 3, 2024 •

edited

Loading

AxeZhan commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

alculquicondor commented Jun 3, 2024

sanposhiho commented Jun 21, 2024

AxeZhan commented Jun 21, 2024

k8s-ci-robot commented Jun 21, 2024

Throughput degradation scheduling daemonset pods #124709

Throughput degradation scheduling daemonset pods #124709

Comments

alculquicondor commented May 6, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented May 6, 2024

alculquicondor commented May 6, 2024

alculquicondor commented May 6, 2024

alculquicondor commented May 6, 2024

sanposhiho commented May 6, 2024

AxeZhan commented May 7, 2024

alculquicondor commented May 7, 2024

chengjoey commented May 7, 2024

alculquicondor commented May 14, 2024

k8s-ci-robot commented May 14, 2024

gabesaba commented May 23, 2024

k8s-ci-robot commented May 23, 2024

alculquicondor commented May 23, 2024

k8s-ci-robot commented May 23, 2024

sanposhiho commented May 31, 2024

chengjoey commented May 31, 2024

alculquicondor commented May 31, 2024 • edited Loading

sanposhiho commented May 31, 2024 • edited Loading

alculquicondor commented May 31, 2024

sanposhiho commented Jun 3, 2024 • edited Loading

AxeZhan commented Jun 3, 2024

k8s-ci-robot commented Jun 3, 2024

alculquicondor commented Jun 3, 2024

sanposhiho commented Jun 21, 2024

AxeZhan commented Jun 21, 2024

k8s-ci-robot commented Jun 21, 2024

alculquicondor commented May 6, 2024 •

edited

Loading

alculquicondor commented May 31, 2024 •

edited

Loading

sanposhiho commented May 31, 2024 •

edited

Loading

sanposhiho commented Jun 3, 2024 •

edited

Loading