Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

lavalamp · 2016-04-14T17:48:41Z

This isn't a flake, the test is just broken. Starting from here:

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-serial/1064/

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-serial/1064/

14:41:13 
14:41:13 • Failure [289.101 seconds]
14:41:13 [k8s.io] SchedulerPredicates [Serial]
14:41:13 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:420
14:41:13   validates MaxPods limit number of pods that are allowed to run [Slow] [It]
14:41:13   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:266
14:41:13 
14:41:13   Not scheduled Pods: []api.Pod(nil)
14:41:13   Expected
14:41:13       <int>: 0
14:41:13   to equal
14:41:13       <int>: 1
14:41:13 
14:41:13   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:107

I guess maybe something dies while it is scheduling everything?

The text was updated successfully, but these errors were encountered:

lavalamp · 2016-04-14T17:49:20Z

@davidopp to triage. This is breaking kubernetes-e2e-gce-serial right now.

david-mcmahon · 2016-04-18T18:29:14Z

kubernetes-e2e-gce-serial hasn't passed for several days. This is a gating job for this week's release.

dchen1107 · 2016-04-18T19:05:03Z

I took a quick look at failed builds of kubernetes-e2e-gce-serial this morning, and all are caused by this very test. I quickly checked why and noticed that most of them are failed due to fluentd.

23:12:53 Apr 17 23:12:53.746: INFO: fluentd-elasticsearch-jenkins-e2e-minion-8ukz started at (0 container statuses recorded)

david-mcmahon · 2016-04-21T21:58:53Z

kubernetes-e2e-gce-serial still not passing. Are we going to pass on the alpha release tomorrow?
My understanding is that kubernetes-e2e-gce-serial is a gating job for a release.
@zmerlynn @roberthbailey

david-mcmahon · 2016-04-21T23:33:42Z

cc @krousey, (build cop for a couple of days), though the issue has been around for a couple of weeks.

krousey · 2016-04-21T23:46:05Z

ping @davidopp

davidopp · 2016-04-21T23:49:57Z

@gmarek can you take a look at this?

gmarek · 2016-04-22T14:37:40Z

Yup. It's a real bug:

$kubectl describe node e2e-test-gmarek-minion-n7fh
...
Non-terminated Pods:        (111 in total)
...

gmarek · 2016-04-22T14:43:55Z

My best guess is that it's caused by parallel computation of bindings, as number of Pods assigned to a Node is read from NodeInfo, not from the 'AssumedPods' struct. @hongchaodeng

gmarek · 2016-04-22T14:52:46Z

The fun thing is that tests started to fail in this way on Apr 12th between 12:44 and 16:26 UTC and as far as I can tell nothing was merged in scheduler codebase back then.

gmarek · 2016-04-22T14:53:16Z

Run 1038 is the last 'good' one that I was able to find.

hongchaodeng · 2016-04-22T15:02:06Z

How can we reproduce it?

$kubectl describe node e2e-test-gmarek-minion-n7fh
...
Non-terminated Pods: (111 in total)

My best guess is that it's caused by parallel computation of bindings, as number of Pods assigned to a Node is read from NodeInfo

@gmarek Can you help clarify more detailed what's happening?

gmarek · 2016-04-22T15:04:36Z

Easily - just run MaxPods test - it reliably fails:). If you want to dig a bit deeper you can add --delete-namespace=false flag when running tests.

Scheduler assigned 111 pods to a given node. It certainly looks like the binding issue, but as I wrote - nothing was merged at the time it started to fail.

wojtek-t · 2016-04-22T15:33:47Z

But parallel binding wasn't merged at that point... so there has to be at least another issue.

gmarek · 2016-04-22T15:35:01Z

Yeah - my point exactly. It may be that we were always broken and some environment change caused this.

wojtek-t · 2016-04-22T15:41:26Z

hmm - take a look here:
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/predicates/predicates.go#L425

it seems we are not checking the maxPods predicate at all.
It's only part of PodFitsResources, which we aren't calling by default, I think.

gmarek · 2016-04-22T15:42:19Z

They moved it here: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/predicates/predicates.go#L457

gmarek · 2016-04-22T15:44:18Z

#20204 was merged at the time.

gmarek · 2016-04-22T15:49:39Z

(I was looking in the wrong place when checking what was merged back then) - I'll try to revert it.

gmarek · 2016-04-22T16:22:53Z

Yeah - #20204 broke our scheduler. Test for maxPods is not included in podFitsResourcesInternal which is currently only predicate registered (instead of PodFitsResources).

cc @davidopp @HaiyangDING @lavalamp

lavalamp · 2016-04-22T17:39:05Z

Sounds like this could have been caught by a unit or integration test.

On Fri, Apr 22, 2016 at 9:23 AM, Marek Grabowski notifications@github.com
wrote:

Yeah - #20204 #20204 broke
our scheduler. Test for maxPods is not included in
podFitsResourcesInternal which is currently only predicate registered
(instead of PodFitsResources).

cc @davidopp https://github.com/davidopp @HaiyangDING
https://github.com/HaiyangDING @lavalamp https://github.com/lavalamp

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#24262 (comment)

gmarek · 2016-04-22T17:49:16Z

Yup - if you convince someone that rewriting SchedulerPredicates tests into integration ones is a P0 more important than other things, I'm happy to do it:) Or I can help someone.

@HaiyangDING

Automatic merge from submit-queue Enforce --max-pods in kubelet admission; previously was only enforced in scheduler This is an ugly hack - I spent some time trying to understand what one NodeInfo has in common with the other one, but at some point decided that I just don't have time to do that. Fixes #24262 Fixes #20263 cc @HaiyangDING @lavalamp

lavalamp added team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 14, 2016

lavalamp assigned davidopp Apr 14, 2016

lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 14, 2016

gmarek assigned gmarek and unassigned davidopp Apr 22, 2016

This was referenced Apr 22, 2016

Enforce --max-pods in kubelet admission; previously was only enforced in scheduler #24674

Merged

Move predicates into library #20204

Merged

k8s-github-robot closed this as completed in #24674 Apr 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

lavalamp commented Apr 14, 2016

lavalamp commented Apr 14, 2016

david-mcmahon commented Apr 18, 2016 •

edited

Loading

dchen1107 commented Apr 18, 2016

david-mcmahon commented Apr 21, 2016

david-mcmahon commented Apr 21, 2016

krousey commented Apr 21, 2016

davidopp commented Apr 21, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

hongchaodeng commented Apr 22, 2016

gmarek commented Apr 22, 2016

wojtek-t commented Apr 22, 2016

gmarek commented Apr 22, 2016

wojtek-t commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

lavalamp commented Apr 22, 2016

gmarek commented Apr 22, 2016

Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

Comments

lavalamp commented Apr 14, 2016

lavalamp commented Apr 14, 2016

david-mcmahon commented Apr 18, 2016 • edited Loading

dchen1107 commented Apr 18, 2016

david-mcmahon commented Apr 21, 2016

david-mcmahon commented Apr 21, 2016

krousey commented Apr 21, 2016

davidopp commented Apr 21, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

hongchaodeng commented Apr 22, 2016

gmarek commented Apr 22, 2016

wojtek-t commented Apr 22, 2016

gmarek commented Apr 22, 2016

wojtek-t commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

gmarek commented Apr 22, 2016

lavalamp commented Apr 22, 2016

gmarek commented Apr 22, 2016

david-mcmahon commented Apr 18, 2016 •

edited

Loading