integration flake: TestMultiScheduler is flaky #22848

wojtek-t · 2016-03-11T09:54:33Z

=== RUN   TestMultiScheduler
--- FAIL: TestMultiScheduler (15.06s)
    scheduler_test.go:370: Test MultiScheduler: pod-with-no-annotation Pod scheduled
    scheduler_test.go:377: Test MultiScheduler: pod-with-annotation-fits-default Pod scheduled
    scheduler_test.go:384: Test MultiScheduler: pod-with-annotation-fits-foo Pod not scheduled
    scheduler_test.go:408: Test MultiScheduler: pod-with-annotation-fits-foo Pod scheduled
    scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>
    scheduler_test.go:447: Test MultiScheduler: pod-with-annotation-fits-default2 Pod scheduled

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/22835/kubernetes-pull-test-unit-integration/18111/build-log.txt
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/22835/kubernetes-pull-test-unit-integration/18111/?debugUI=CLOUD

@davidopp

The text was updated successfully, but these errors were encountered:

davidopp · 2016-03-11T20:26:22Z

@wojtek-t I assume #22835 isn't actually a fix for this, just a way to get faster logs?

wojtek-t · 2016-03-11T20:34:57Z

@davidopp Actually #22835 is only where the test failed. To be honest, I didn't have time to look into it at all.

davidopp · 2016-03-18T05:06:57Z

@mml any progress on this?

ixdy · 2016-03-31T21:22:21Z

http://pr-test.k8s.io/23415/kubernetes-pull-test-unit-integration/19839/ is another failure.

scheduler_test.go:370: Test MultiScheduler: pod-with-no-annotation Pod scheduled
scheduler_test.go:377: Test MultiScheduler: pod-with-annotation-fits-default Pod scheduled
scheduler_test.go:384: Test MultiScheduler: pod-with-annotation-fits-foo Pod not scheduled
scheduler_test.go:408: Test MultiScheduler: pod-with-annotation-fits-foo Pod scheduled
scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>
scheduler_test.go:447: Test MultiScheduler: pod-with-annotation-fits-default2 Pod scheduled

mml · 2016-03-31T21:39:24Z

Only this one is an actual error.

    scheduler_test.go:439: Test MultiScheduler: pod-with-no-annotation2 Pod got scheduled, <nil>

And... the logging logic doesn't make much sense. It logs "err", but it only gets there if err was nil.

I can't see any other logs besides the junit logs, which don't provide any extra info. Am I missing something?

mml · 2016-03-31T21:48:58Z

There is a concurrency bug in the test. We signal the scheduler to stop by closing a channel but we don't synchronize at this point to verify that the scheduler has stopped. Indeed, since scheduleOne blocks on NextPod in the absence of any work, we could sit around a very long time, schedule one more thing, and only then get the signal to exit.

We need to synchronize here, but once we do that, we may need to prevent the deadlock that occurs while scheduleOne blocks forever waiting for NextPod().

I don't know that this is the bug, but the bug is there and it fits the symptom.

mml · 2016-03-31T22:06:35Z

This is similar to the bug I fixed with #22727. The Until() function seems susceptible to this by design. Not only does it provide no way to verify that the goroutine has been stopped, it makes it hard to avoid deadlock because the provided f may block forever.

We should probably re-do this interface to avoid the problem, although in practice I bet it only pops up in tests. @davidopp I see two choices: invest in fixing this now, or comment out all the test that come after the assumption "now we've stopped this scheduler" and fix it later. WDYT?

davidopp · 2016-03-31T23:11:28Z

When you say

comment out all the test that come after the assumption "now we've stopped this scheduler"

Are you talking about everything after step 7?

mml · 2016-03-31T23:12:50Z

@davidopp Yes.

davidopp · 2016-03-31T23:20:50Z

I guess it is fine to comment out everything after step 7 for now, if you don't see any easy alternative. The key is that we test that the right scheduler is scheduling the pod, and while steps 8/9 do that, earlier parts of the test also do it and I think they're adequate.

Please file an issue related to the problem you described, so we can fix it eventually.

Should fix kubernetes#22848.

wojtek-t added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 11, 2016

wojtek-t mentioned this issue Mar 11, 2016

Use SCP to dump logs and parallelize a bit. #22835

Merged

mml self-assigned this Mar 11, 2016

ixdy mentioned this issue Mar 31, 2016

Refactor upload-started.sh and upload-finished.sh into upload-to-gcs.sh #23415

Merged

mml mentioned this issue Mar 31, 2016

wait.Until() has no way to guarantee goroutine termination #23715

Closed

mml added a commit to mml/kubernetes that referenced this issue Mar 31, 2016

Comment out racey part of the multi-scheduler test.

6ace15e

Should fix kubernetes#22848.

mml mentioned this issue Mar 31, 2016

Comment out racey part of the multi-scheduler test. #23717

Merged

saad-ali closed this as completed in #23717 Apr 5, 2016

rootfs pushed a commit to rootfs/kubernetes that referenced this issue Apr 8, 2016

Comment out racey part of the multi-scheduler test.

082739b

Should fix kubernetes#22848.

sjpotter pushed a commit to sjpotter/kubernetes that referenced this issue Apr 14, 2016

Comment out racey part of the multi-scheduler test.

f8d163e

Should fix kubernetes#22848.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integration flake: TestMultiScheduler is flaky #22848

integration flake: TestMultiScheduler is flaky #22848

wojtek-t commented Mar 11, 2016

davidopp commented Mar 11, 2016

wojtek-t commented Mar 11, 2016

davidopp commented Mar 18, 2016

ixdy commented Mar 31, 2016

mml commented Mar 31, 2016

mml commented Mar 31, 2016

mml commented Mar 31, 2016

davidopp commented Mar 31, 2016

mml commented Mar 31, 2016

davidopp commented Mar 31, 2016

integration flake: TestMultiScheduler is flaky #22848

integration flake: TestMultiScheduler is flaky #22848

Comments

wojtek-t commented Mar 11, 2016

davidopp commented Mar 11, 2016

wojtek-t commented Mar 11, 2016

davidopp commented Mar 18, 2016

ixdy commented Mar 31, 2016

mml commented Mar 31, 2016

mml commented Mar 31, 2016

mml commented Mar 31, 2016

davidopp commented Mar 31, 2016

mml commented Mar 31, 2016

davidopp commented Mar 31, 2016