Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing ci-benchmark-scheduler-perf-master tests for PreemptionAsync and Unschedulable tests #128221

Open
macsko opened this issue Oct 21, 2024 · 14 comments · Fixed by #128262
Open
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@macsko
Copy link
Member

macsko commented Oct 21, 2024

Which jobs are failing?

ci-benchmark-scheduler-perf-master

Which tests are failing?

PreemptionAsync and Unschedulable test cases

Since when has it been failing?

17th Oct 2024

Testgrid link

https://testgrid.k8s.io/sig-scalability-benchmarks#scheduler-perf

Reason for failure (if possible)

PreemptionAsync test is failing because of context deadline error:

    scheduler_perf.go:1427: FATAL ERROR: op 3: error in waiting for pods to get scheduled: at least pod namespace-3/pod-4vfnq is not scheduled: context deadline exceeded
--- FAIL: BenchmarkPerfScheduling/PreemptionAsync/5000Node

Caused by #127829

Unschedulable test is failing because of too high threshold configured:

    scheduler_perf.go:1298: ERROR: op 2: BenchmarkPerfScheduling/Unschedulable/5kNodes/10kPods/namespace-2: expected SchedulingThroughput Average to be higher: got 289.204988, want 400.000000
--- FAIL: BenchmarkPerfScheduling/Unschedulable/5kNodes/10kPods

Caused by #128153
After making these changes, the threshold should be even lower, as it seems a value around 270-280 should be good at the moment.

Anything else we need to know?

You should ignore the first row in testgrid (k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf), as it doesn't affect the results and check only the second one (ci-benchmark-scheduler-perf-master.Overall).

Relevant SIG(s)

/sig scheduling

@macsko macsko added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 21, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 21, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@macsko
Copy link
Member Author

macsko commented Oct 21, 2024

/assign @dom4ha

@dom4ha
Copy link
Member

dom4ha commented Oct 22, 2024

Thanks @macsko, working in #128262 to tune these params.

@sanposhiho
Copy link
Member

/reopen
/remove-kind failing-test
/kind flake

Still looks like flake?
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/ci-benchmark-scheduler-perf-master?buildId=1846949494873133056

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Oct 26, 2024
@k8s-ci-robot k8s-ci-robot reopened this Oct 26, 2024
@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Reopened this issue.

In response to this:

/reopen
/remove-kind failing-test
/kind flake

Still looks like flake?
https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/ci-benchmark-scheduler-perf-master?buildId=1846949494873133056

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sanposhiho
Copy link
Member

Unrelated to the flake though, in the logs during PreemptionAsync, I don't see any preemption success messages. Rather, I see ↓ only:

I1026 11:25:33.658425   31249 schedule_one.go:1056] "Unable to schedule pod; no fit; waiting" pod="namespace-2/pod-h-glg8x" err="0/5000 nodes are available: 5000 Insufficient cpu. preemption: 0/5000 nodes are available: 5000 Insufficient cpu."
I10

Have we made mistake making this test case somewhere? I'll look into it

@sanposhiho
Copy link
Member

Node only has 4 cpu, while high priority pods have 9 CPU... 😓

@dom4ha
Copy link
Member

dom4ha commented Oct 28, 2024

Thanks for spotting it. I had to mess up at some point. Initially I was modifying PreemptionBasic test which had uses config/templates/pod-high-priority.yaml and the preemption was indeed working...

I will look closer into this test again.

@dom4ha
Copy link
Member

dom4ha commented Oct 28, 2024

In general the performance test are flaky. I checked a few failures and in all the cases there were a few tests failing due to

expected SchedulingThroughput Average to be higher: got X, want Y

scheduler_perf.go:1298: ERROR: op 2: BenchmarkPerfScheduling/Unschedulable/5kNodes/10kPods/namespace-2: expected SchedulingThroughput Average to be higher: got 246.885834, want 250.000000
scheduler_perf.go:1298: ERROR: op 3: BenchmarkPerfScheduling/PreemptionAsync/5000Nodes/namespace-3: expected SchedulingThroughput Average to be higher: got 103.320968, want 120.000000
scheduler_perf.go:1298: ERROR: op 2: BenchmarkPerfScheduling/Unschedulable/5kNodes/10kPods/namespace-2: expected SchedulingThroughput Average to be higher: got 246.885834, want 250.000000

I can adjust the limits to reduce the flakiness level, but I'm afraid that there will be quite substantial test throughput variation, which will periodically fail this test anyway.

@sanposhiho
Copy link
Member

I can adjust the limits to reduce the flakiness level, but I'm afraid that there will be quite substantial test throughput variation, which will periodically fail this test anyway.

Yeah, too conservative threshold wouldn't be great.

So, are the results from those two more scattered than other tests?
If No, we can just lower the thresholds, at least for now. Because that means, even if we lowered the thresholds here, they'd just be as conservative as other tests and could be acceptable.
But, if Yes, then I think the best way is to find why those tests are more scattered than others, and try to make the result more stable..

@macsko
Copy link
Member Author

macsko commented Oct 29, 2024

These tests aren't more scattered than the others. Unschedulable (yellow = average):
Screenshot 2024-10-29 at 09 07 42
I think threshold around 200 should be good.

PreemptionAsync:
Screenshot 2024-10-29 at 09 09 21
Scheduling throughput increased after #128348 and I think the actual threshold is too low, so we should increase it after getting more data.

@dom4ha
Copy link
Member

dom4ha commented Oct 29, 2024

These tests aren't more scattered than the others. Unschedulable (yellow = average)

Exactly. I actually gave wrong example, as I meant that other tests are also flaky (not only PreemptionAsync and Unschedulable):

scheduler_perf.go:1298: ERROR: op 2: BenchmarkPerfScheduling/SchedulingWithNodeInclusionPolicy/5000Nodes/namespace-2: expected SchedulingThroughput Average to be higher: got 49.791495, want 68.000000
scheduler_perf.go:1298: ERROR: op 3: BenchmarkPerfScheduling/PreemptionAsync/5000Nodes/namespace-3: expected SchedulingThroughput Average to be higher: got 91.335472, want 120.000000

I will gathered some numbers for the most recent runs to see whether/how the thresholds should be adjusted.

Scheduling throughput increased after #128348 and I think the actual threshold is too low, so we should increase it after getting more data.

The performance difference is quite surprising to me. I'd expect the fixed test to work slower (go through full preemption process). Note that the Unschedulable test is in fact similar to the broken test, but the throughput is much higher (to be precise, the throughput is comparable, but churn rate is much higher - 100/s vs 5/s).

The main difference between them is scheduling high priority pods in churn. This let me think that the Unschedulable test is actually not doing what I thought it will do, because the churn pods get to the end of the queue, so in the end we process way lower number than expected.

I changed to put high priority pods instead, so that they really take scheduler time at the defined rate. The throughput should become comparable now: #128427

@sanposhiho
Copy link
Member

The performance difference is quite surprising to me.

I guess, the reason is that, in a previous test case, all periodically created high-priority Pods are all unschedulable, piled up on the queue, and result in an additional load for the scheduleer.

@dom4ha
Copy link
Member

dom4ha commented Oct 30, 2024

I guess, the reason is that, in a previous test case, all periodically created high-priority Pods are all unschedulable, piled up on the queue, and result in an additional load for the scheduleer.

I looked deeper into this phenomena and I noticed that the time of processing unschedulable pods depends on the number of scheduled pods (initialPods). Apparently, in PostFilter, the preemption plugin goes thought all the pods even though the Pod is unschedulable on the Node itself.

So, the preemption scenario is faster, as the preemption plugin finds candidates very quickly and the time to preempt them (send api calls synchronously) is actually smaller than going through all the remaining candidates.

In the context of #126858, in cases where there are thousands of pods running, making api calls asynchronously for the unschedulable pods may not bring the expected improvements, as going though all pod candidates may be more expensive than the api call itself. I can imagine how high priority unschedulable pods can block scheduling for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants