Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration flake: TestMultiScheduler is flaky #22848

Closed
wojtek-t opened this issue Mar 11, 2016 · 10 comments · Fixed by #23717
Closed

integration flake: TestMultiScheduler is flaky #22848

wojtek-t opened this issue Mar 11, 2016 · 10 comments · Fixed by #23717
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@wojtek-t
Copy link
Member

=== RUN   TestMultiScheduler
--- FAIL: TestMultiScheduler (15.06s)
    scheduler_test.go:370: Test MultiScheduler: pod-with-no-annotation Pod scheduled
    scheduler_test.go:377: Test MultiScheduler: pod-with-annotation-fits-default Pod scheduled
    scheduler_test.go:384: Test MultiScheduler: pod-with-annotation-fits-foo Pod not scheduled
    scheduler_test.go:408: Test MultiScheduler: pod-with-annotation-fits-foo Pod scheduled
    scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>
    scheduler_test.go:447: Test MultiScheduler: pod-with-annotation-fits-default2 Pod scheduled

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/22835/kubernetes-pull-test-unit-integration/18111/build-log.txt
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/22835/kubernetes-pull-test-unit-integration/18111/?debugUI=CLOUD

@davidopp

@wojtek-t wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 11, 2016
@davidopp
Copy link
Member

@wojtek-t I assume #22835 isn't actually a fix for this, just a way to get faster logs?

@wojtek-t
Copy link
Member Author

@davidopp Actually #22835 is only where the test failed. To be honest, I didn't have time to look into it at all.

@mml mml self-assigned this Mar 11, 2016
@davidopp
Copy link
Member

@mml any progress on this?

@ixdy
Copy link
Member

ixdy commented Mar 31, 2016

http://pr-test.k8s.io/23415/kubernetes-pull-test-unit-integration/19839/ is another failure.

scheduler_test.go:370: Test MultiScheduler: pod-with-no-annotation Pod scheduled
scheduler_test.go:377: Test MultiScheduler: pod-with-annotation-fits-default Pod scheduled
scheduler_test.go:384: Test MultiScheduler: pod-with-annotation-fits-foo Pod not scheduled
scheduler_test.go:408: Test MultiScheduler: pod-with-annotation-fits-foo Pod scheduled
scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>
scheduler_test.go:447: Test MultiScheduler: pod-with-annotation-fits-default2 Pod scheduled

@mml
Copy link
Contributor

mml commented Mar 31, 2016

Only this one is an actual error.

    scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>

And... the logging logic doesn't make much sense. It logs "err", but it only gets there if err was nil.

I can't see any other logs besides the junit logs, which don't provide any extra info. Am I missing something?

@mml
Copy link
Contributor

mml commented Mar 31, 2016

There is a concurrency bug in the test. We signal the scheduler to stop by closing a channel but we don't synchronize at this point to verify that the scheduler has stopped. Indeed, since scheduleOne blocks on NextPod in the absence of any work, we could sit around a very long time, schedule one more thing, and only then get the signal to exit.

We need to synchronize here, but once we do that, we may need to prevent the deadlock that occurs while scheduleOne blocks forever waiting for NextPod().

I don't know that this is the bug, but the bug is there and it fits the symptom.

@mml
Copy link
Contributor

mml commented Mar 31, 2016

This is similar to the bug I fixed with #22727. The Until() function seems susceptible to this by design. Not only does it provide no way to verify that the goroutine has been stopped, it makes it hard to avoid deadlock because the provided f may block forever.

We should probably re-do this interface to avoid the problem, although in practice I bet it only pops up in tests. @davidopp I see two choices: invest in fixing this now, or comment out all the test that come after the assumption "now we've stopped this scheduler" and fix it later. WDYT?

@davidopp
Copy link
Member

When you say

comment out all the test that come after the assumption "now we've stopped this scheduler"

Are you talking about everything after step 7?

@mml
Copy link
Contributor

mml commented Mar 31, 2016

@davidopp Yes.

@davidopp
Copy link
Member

I guess it is fine to comment out everything after step 7 for now, if you don't see any easy alternative. The key is that we test that the right scheduler is scheduling the pod, and while steps 8/9 do that, earlier parts of the test also do it and I think they're adequate.

Please file an issue related to the problem you described, so we can fix it eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants