Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken test: [k8s.io] SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow] #24262

Closed
lavalamp opened this issue Apr 14, 2016 · 22 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@lavalamp
Copy link
Member

This isn't a flake, the test is just broken. Starting from here:

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-serial/1064/

https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-serial/1064/

14:41:13 
14:41:13 • Failure [289.101 seconds]
14:41:13 [k8s.io] SchedulerPredicates [Serial]
14:41:13 /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:420
14:41:13   validates MaxPods limit number of pods that are allowed to run [Slow] [It]
14:41:13   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:266
14:41:13 
14:41:13   Not scheduled Pods: []api.Pod(nil)
14:41:13   Expected
14:41:13       <int>: 0
14:41:13   to equal
14:41:13       <int>: 1
14:41:13 
14:41:13   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduler_predicates.go:107

I guess maybe something dies while it is scheduling everything?

@lavalamp lavalamp added team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 14, 2016
@lavalamp
Copy link
Member Author

@davidopp to triage. This is breaking kubernetes-e2e-gce-serial right now.

@lavalamp lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 14, 2016
@david-mcmahon
Copy link
Contributor

david-mcmahon commented Apr 18, 2016

kubernetes-e2e-gce-serial hasn't passed for several days. This is a gating job for this week's release.

@dchen1107
Copy link
Member

I took a quick look at failed builds of kubernetes-e2e-gce-serial this morning, and all are caused by this very test. I quickly checked why and noticed that most of them are failed due to fluentd.

23:12:53 Apr 17 23:12:53.746: INFO: fluentd-elasticsearch-jenkins-e2e-minion-8ukz started at (0 container statuses recorded)

@david-mcmahon
Copy link
Contributor

kubernetes-e2e-gce-serial still not passing. Are we going to pass on the alpha release tomorrow?
My understanding is that kubernetes-e2e-gce-serial is a gating job for a release.
@zmerlynn @roberthbailey

@david-mcmahon
Copy link
Contributor

cc @krousey, (build cop for a couple of days), though the issue has been around for a couple of weeks.

@krousey
Copy link
Contributor

krousey commented Apr 21, 2016

ping @davidopp

@davidopp
Copy link
Member

@gmarek can you take a look at this?

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Yup. It's a real bug:

$kubectl describe node e2e-test-gmarek-minion-n7fh
...
Non-terminated Pods:        (111 in total)
...

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

My best guess is that it's caused by parallel computation of bindings, as number of Pods assigned to a Node is read from NodeInfo, not from the 'AssumedPods' struct. @hongchaodeng

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

The fun thing is that tests started to fail in this way on Apr 12th between 12:44 and 16:26 UTC and as far as I can tell nothing was merged in scheduler codebase back then.

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Run 1038 is the last 'good' one that I was able to find.

@hongchaodeng
Copy link
Contributor

How can we reproduce it?

$kubectl describe node e2e-test-gmarek-minion-n7fh
...
Non-terminated Pods: (111 in total)

My best guess is that it's caused by parallel computation of bindings, as number of Pods assigned to a Node is read from NodeInfo

@gmarek Can you help clarify more detailed what's happening?

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Easily - just run MaxPods test - it reliably fails:). If you want to dig a bit deeper you can add --delete-namespace=false flag when running tests.

Scheduler assigned 111 pods to a given node. It certainly looks like the binding issue, but as I wrote - nothing was merged at the time it started to fail.

@wojtek-t
Copy link
Member

But parallel binding wasn't merged at that point... so there has to be at least another issue.

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Yeah - my point exactly. It may be that we were always broken and some environment change caused this.

@wojtek-t
Copy link
Member

hmm - take a look here:
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/scheduler/algorithm/predicates/predicates.go#L425

it seems we are not checking the maxPods predicate at all.
It's only part of PodFitsResources, which we aren't calling by default, I think.

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

#20204 was merged at the time.

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

(I was looking in the wrong place when checking what was merged back then) - I'll try to revert it.

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Yeah - #20204 broke our scheduler. Test for maxPods is not included in podFitsResourcesInternal which is currently only predicate registered (instead of PodFitsResources).

cc @davidopp @HaiyangDING @lavalamp

@lavalamp
Copy link
Member Author

Sounds like this could have been caught by a unit or integration test.

On Fri, Apr 22, 2016 at 9:23 AM, Marek Grabowski notifications@github.com
wrote:

Yeah - #20204 #20204 broke
our scheduler. Test for maxPods is not included in
podFitsResourcesInternal which is currently only predicate registered
(instead of PodFitsResources).

cc @davidopp https://github.com/davidopp @HaiyangDING
https://github.com/HaiyangDING @lavalamp https://github.com/lavalamp


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#24262 (comment)

@gmarek
Copy link
Contributor

gmarek commented Apr 22, 2016

Yup - if you convince someone that rewriting SchedulerPredicates tests into integration ones is a P0 more important than other things, I'm happy to do it:) Or I can help someone.

k8s-github-robot pushed a commit that referenced this issue Apr 23, 2016
Automatic merge from submit-queue

Enforce --max-pods in kubelet admission; previously was only enforced in scheduler

This is an ugly hack - I spent some time trying to understand what one NodeInfo has in common with the other one, but at some point decided that I just don't have time to do that.

Fixes #24262
Fixes #20263

cc @HaiyangDING @lavalamp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

8 participants