-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing ci-benchmark-scheduler-perf-master tests for PreemptionAsync and Unschedulable tests #128221
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign @dom4ha |
/reopen Still looks like flake? |
@sanposhiho: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Unrelated to the flake though, in the logs during PreemptionAsync, I don't see any preemption success messages. Rather, I see ↓ only:
Have we made mistake making this test case somewhere? I'll look into it |
Node only has 4 cpu, while high priority pods have 9 CPU... 😓 |
Thanks for spotting it. I had to mess up at some point. Initially I was modifying PreemptionBasic test which had uses config/templates/pod-high-priority.yaml and the preemption was indeed working... I will look closer into this test again. |
In general the performance test are flaky. I checked a few failures and in all the cases there were a few tests failing due to
I can adjust the limits to reduce the flakiness level, but I'm afraid that there will be quite substantial test throughput variation, which will periodically fail this test anyway. |
Yeah, too conservative threshold wouldn't be great. So, are the results from those two more scattered than other tests? |
These tests aren't more scattered than the others. Unschedulable (yellow = average): PreemptionAsync: |
Exactly. I actually gave wrong example, as I meant that other tests are also flaky (not only PreemptionAsync and Unschedulable):
I will gathered some numbers for the most recent runs to see whether/how the thresholds should be adjusted.
The performance difference is quite surprising to me. I'd expect the fixed test to work slower (go through full preemption process). Note that the The main difference between them is scheduling high priority pods in churn. This let me think that the I changed to put high priority pods instead, so that they really take scheduler time at the defined rate. The throughput should become comparable now: #128427 |
I guess, the reason is that, in a previous test case, all periodically created high-priority Pods are all unschedulable, piled up on the queue, and result in an additional load for the scheduleer. |
I looked deeper into this phenomena and I noticed that the time of processing unschedulable pods depends on the number of scheduled pods (initialPods). Apparently, in PostFilter, the preemption plugin goes thought all the pods even though the Pod is unschedulable on the Node itself. So, the preemption scenario is faster, as the preemption plugin finds candidates very quickly and the time to preempt them (send api calls synchronously) is actually smaller than going through all the remaining candidates. In the context of #126858, in cases where there are thousands of pods running, making api calls asynchronously for the unschedulable pods may not bring the expected improvements, as going though all pod candidates may be more expensive than the api call itself. I can imagine how high priority unschedulable pods can block scheduling for some time. |
Which jobs are failing?
ci-benchmark-scheduler-perf-master
Which tests are failing?
PreemptionAsync and Unschedulable test cases
Since when has it been failing?
17th Oct 2024
Testgrid link
https://testgrid.k8s.io/sig-scalability-benchmarks#scheduler-perf
Reason for failure (if possible)
PreemptionAsync test is failing because of context deadline error:
Caused by #127829
Unschedulable test is failing because of too high threshold configured:
Caused by #128153
After making these changes, the threshold should be even lower, as it seems a value around 270-280 should be good at the moment.
Anything else we need to know?
You should ignore the first row in testgrid (k8s.io/kubernetes/test/integration/scheduler_perf.scheduler_perf), as it doesn't affect the results and check only the second one (ci-benchmark-scheduler-perf-master.Overall).
Relevant SIG(s)
/sig scheduling
The text was updated successfully, but these errors were encountered: