Scheduler & Kubelet are not responding immediately on work to do #6059

wojtek-t · 2015-03-27T14:39:35Z

While running performance tests (start 30 pods per node in 50 nodes cluster) I faced the following problem.

Create replication controller and resize it to 1500. I saw the following logs:
I0327 14:45:09.036772 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 0 pods out of 1500
I0327 14:45:16.918259 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 835 pods out of 1500
I0327 14:45:24.465116 14025 density.go:129] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 1500 pods out of 1500
However, looking into scheduler logs, the first time when it attempts to schedule any of those pods is related to the log:
I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e

This is roughly ~45 seconds after the first pod was created. The question is what the scheduler was doing in the meantime - the only logs from that time are:
I0327 13:40:52.230407 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:40:52.231778 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [4053/16]) [
5052]
I0327 13:45:53.272954 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:45:53.273448 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:45:53.277590 4551 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]) [1
3988]
E0327 13:45:53.277823 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]
) [13988]
I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e

A very similar thing happened in Kubelet. During the same test, the first pod was bind to minion "*0jxm" at:
I0327 13:46:00.291905 4551 factory.go:303] Attempting to bind my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl to e2e-test-wojtekt-minion-0jxm.c.groovy-sentry-504.internal

However, Kubelet reacted to it 50 seconds later:
I0327 13:46:50.560684 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl_default" found. All containers will be restarted.

Again, it seems that nothing was really happening during that 50 seconds. The Kubelet logs from that time are:
I0327 13:44:55.038542 4712 server.go:654] GET /healthz: (886.441µs) 0 [[monit/5.4] 127.0.0.1:37322]
E0327 13:46:43.817523 4712 reflector.go:149] watch of *api.Node ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17177/7110]) [18176
]
I0327 13:46:46.492099 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:46:46.492181 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:46:47.256804 4712 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17354/6384]) [18
353]
E0327 13:46:47.256950 4712 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17352/8087]) [18351]
I0327 13:46:50.558471 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b8hwgv_default" found. All containers will be restarted.
...
[and it works since then].

Investigate what is happening during that 45 seconds periods of "nothing".

cc @fgrzadkowski

wojtek-t · 2015-03-27T14:57:27Z

My feeling is that it might be related to some watch issues.

BTW - since I wouldn't be able to work on it today, feel free to pick it up. If noone picks it up, I can start looking into it on Monday.

satnam6502 · 2015-03-27T15:44:23Z

/sub

bprashanth · 2015-03-27T16:00:44Z

However, looking into scheduler logs, the first time when it attempts to schedule any of those pods is related to the log:

I can only reproduce tail latency (#3954). @wojtek-t are your performance tests checked in so I can play with them (preferably with cluster setup instructions)?

Edit: Oh, did you just run @satnam6502 's soak test?

fgrzadkowski · 2015-03-27T17:17:54Z

I've seen similar problem but in my case scheduler never noticed new pods. Investigating...

AFAIU @wojtek-t has a similar setup to mine:

e2e cluster with 50 nodes n1-standard-1, master n1-standard-4.
running density test with 30 pods per node.

wojtek-t · 2015-03-27T17:47:47Z

Yes - I Filip wrote, I'm using Density e2e test, to start 30 pods per node (see also #6051).

Also note that this is not happening all the time - it's just kind of "flakiness"

yujuhong · 2015-03-27T18:44:06Z

To isolate the performance kubelet more, I ran a simple test which resized the RC to create pods with PodSpec.Host set to a specific node. This bypassed the scheduler completely.

In short, it took ~30 seconds for 30 pods to be reported running on one node. This includes:

RC creates 30 pods
Kubelet receives pod updates from watch
Kubelet starts containers (I used kubernetes/pause images, so there should be no pulling time).
Kubelet reports the status back to the apiserver

The cluster size is irrelevant in this case, since only one node is being tested.

wojtek-t · 2015-03-30T06:21:02Z

I was running a very similar experiments (i.e. 1-node cluster) some time ago and also didn't observe it at all. Thus my feeling is that it might be not related neither to scheduler not kubelet, but probably to etcd and/or apiserver which cannot handle a bigger load.

wojtek-t · 2015-03-30T08:04:16Z

In fact I think this is related to watch [I think that from some reason some changes come to the component through watch with some significant delay]

fgrzadkowski · 2015-03-30T08:51:33Z

@yujuhong I don't think cluster size is irrelevant in this experiment, because all the data is read from apiserver/etcd which performance is strictly related with the size of the cluster. As @wojtek-t said we suspect that the problem is somewhere close to watch itself, but we need to dig deeper to understand the problem better.

wojtek-t · 2015-03-30T12:04:34Z

I've added some logging and I can confirm that this problem is not related to the scheduler or kubelet themself - the issue is (as I thought) that some changes come to those component through watch with a significant delay (e.g. 1 minute). I will continue investigating if it's related to watch itself or maybe some high load on apiserver or etcd.

yujuhong · 2015-03-30T17:04:16Z

@fgrzadkowski, I meant for the kubelet startup time alone, the cluster size was irrelevant (which was the case in my test). As for when kubelet receives the update for the first time, that's a different story :)
My test was to simply provide another data point to potentially rule out the possibility that kubelet's performance is the bottleneck.

dchen1107 · 2015-03-30T22:29:43Z

@zmerlynn reported an issue through #6139. After debugging, I think we ran into this issue a lot:

#6139 (comment)

I observed both scheduler and kubelet doesn't pick up the work immediately. The only common pattern used by both components are using watcher.

lavalamp · 2015-03-31T00:02:05Z

1500 writes to etcd is a lot. It's enough to expire the 1000 change history. This could mean that if scheduler happens to be doing a re-list when the writes start, it won't be able to watch when it's done with the list, so it will have to do another list.

We should measure how long a list takes the scheduler.

How sure are you that the clocks on your local system and the scheduler are relatively well synchronized?

wojtek-t · 2015-03-31T06:51:13Z

@dchen1107 : thanks for looking into it too - what you observed in #6139 (comment) is exactly what I also observed for bigger clusters. I didn't observe it in e2e tests, but I also didn't try to find it in e2e tests. But my feeling is that it's much easier to replicate it in bigger clusters.

@lavalamp : yes I agree that this generates high load on etcd - I will try to look deeper at how long "listing pods" lasts - thanks for suggestion!
Also - I'm pretty sure it's not related to the time skew (I also had this hypothesis at some point but verified that it's not the case)

wojtek-t · 2015-03-31T12:44:42Z

I tried to debug it deeper and have few more findings about it. I've added a bunch of different logs and here is what is happening:

Scheduler (or Kubelet) is "opening" watch for pods.
For some time (say 3 minutes) nod pods are changing, so there are no writes with pods in etcd.
In the meantime, a bunch of other writes to etcd is happening (e.g. we're updating NodeStatus once per ~2 seconds). In 50 node cluster, it's clear than more than 1000 writes can happen during that time.
Then new pods are created (by replication controller). However, they are not send through this watch to scheduler [this is something that I still don't understand why]
After 5 minutes since the watch was opened it expires. So we reopen it (still with the previous resourceVersion).
Etcd reports an error: "The event in requested index is outdated and cleared","cause":"the requested history has been cleared"
After that we "start from the beginning": we list all pods, update the internal storage and open watch from the beginning. From that point everything starts to work fine.

So the thing that we need to understand is the point 4, i.e. why the changes are not delivered through watch that was opened some time ago and there were more than 1000 other changes in the meantime (because this is the reason of it) (I'm not 100% sure that those 1000 other changes are not a red herring, but this seems to be the most probable cause)

dchen1107 · 2015-03-31T17:33:43Z

cc/ @xiang90 coreos etcd expert.

Xiang, is this the same / similar issue as etcd-io/etcd/issues/2048? We are running etcd 2.X, and go-etcd client is 0.4.6.

smarterclayton · 2015-03-31T17:40:37Z

This is the same issue.

derekwaynecarr · 2015-03-31T17:42:44Z

I don't think item 4 in his sequence is the same issue...

xiang90 · 2015-03-31T17:51:41Z

After 5 minutes since the watch was opened it expires. So we reopen it (still with the previous resourceVersion).

This has been fixed in etcd2 I think. @wojtek-t

dchen1107 · 2015-03-31T17:52:48Z

@smarterclayton and @derekwaynecarr I think it is the same issue with some fixes, but not fully in etcd 2.X, and item 4 is the new symptoms.

wojtek-t · 2015-04-03T18:43:37Z

@xiang90 Thanks a lot for fixing it. Regarding canceling - yes we are aware of it. There is already a code that handles it.
One more thing - IIUC in order to this change be effective, we need a new release of etcd, right? So when can we expect the new release?

lavalamp · 2015-04-03T18:52:29Z

Yeah getting dropped due to high load should be handled correctly already.

xiang90 · 2015-04-03T20:41:52Z

@wojtek-t Yes. I will have a release early next week. You can try with the current master though.

davidopp · 2015-04-06T19:06:14Z

@wojtek-t Can you take responsibility for testing the new version of etcd (once it is released) to verify that it fixes this problem? You should also roll back @lavalamp 's workaround (#6381)

wojtek-t · 2015-04-07T06:29:19Z

@davidopp Sure - will take care of it.

wojtek-t · 2015-04-07T11:31:29Z

@xiang90 Can you please ping this thread/me when the new etcd release will be built? Thanks!

dchen1107 · 2015-04-07T20:09:53Z

@xiang90 since the fix the very simple, can we have a patch release here?

xiang90 · 2015-04-07T20:12:35Z

@wojtek-t @dchen1107 We are doing it. We will release it later today or early tomorrow.

dchen1107 · 2015-04-07T20:16:47Z

Please let us know as it is ready. We can rebuild our docker image with the release and start the test again.

yichengq · 2015-04-07T22:47:31Z

@dchen1107 We just cut the etcd v2.0.9: https://github.com/coreos/etcd/tree/v2.0.9
It contains the bug fix for watcher issue.

dchen1107 · 2015-04-07T22:49:51Z

Great. I will build our docker image for this, @wojtek-t you can pick up tonight PCT for your performance validation. Thanks @yichengq!

dchen1107 · 2015-04-08T00:38:16Z

@wojtek-t You can upgrade your cluster with 2.0.9 and validate the fix now.

(cherry picked from commit 6be09c5)

wojtek-t · 2015-04-08T06:18:53Z

@dchen1107 Will do it today. Thanks for creating the image.

This reverts commit 6be09c5.

wojtek-t · 2015-04-08T12:16:24Z

I wasn't able to reproduce this problem with the new version of etcd (it was very easy to reproduce with the previous version).
So I'm going to revert the temporary workaround done by @lavalamp and close this issue.

Revert "Stop the bleeding for #6059"

davidopp · 2015-04-08T21:20:41Z

Yay!

wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. misc/perf-1.0 labels Mar 27, 2015

wojtek-t self-assigned this Mar 27, 2015

wojtek-t added this to the v1.0 milestone Mar 27, 2015

wojtek-t removed their assignment Mar 27, 2015

davidopp added team/master priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2015

wojtek-t mentioned this issue Mar 30, 2015

99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954

Closed

dchen1107 mentioned this issue Mar 30, 2015

Pod status not relayed in a timely fashion #6139

Closed

wojtek-t self-assigned this Mar 31, 2015

wojtek-t mentioned this issue Apr 7, 2015

apiserver server non-responsive #5650

Closed

timothysc mentioned this issue Apr 7, 2015

Watch shouldn't break every 5 minutes #6513

Closed

dchen1107 mentioned this issue Apr 7, 2015

Build etcd image (version 2.0.9), and upgrade kubernetes cluster to the new version #6544

Merged

brendandburns pushed a commit to brendandburns/kubernetes that referenced this issue Apr 8, 2015

Stop the bleeding for kubernetes#6059

c7b7e08

(cherry picked from commit 6be09c5)

wojtek-t added a commit to wojtek-t/kubernetes that referenced this issue Apr 8, 2015

Revert "Stop the bleeding for kubernetes#6059"

8a5c763

This reverts commit 6be09c5.

wojtek-t mentioned this issue Apr 8, 2015

Revert "Stop the bleeding for #6059" #6567

Merged

fgrzadkowski closed this as completed in #6567 Apr 8, 2015

fgrzadkowski added a commit that referenced this issue Apr 8, 2015

Merge pull request #6567 from wojtek-t/fix_6059

24b478d

Revert "Stop the bleeding for #6059"

bprashanth mentioned this issue Apr 10, 2015

Write a reflector benchmark #6697

Closed

wojtek-t mentioned this issue Apr 21, 2015

Scheduling plateaus due to POD killing #5654

Closed

WIZARD-CXY mentioned this issue May 20, 2015

Update etcd version to 2.0.9 for ubuntu k8s #8558

Merged

piascikj mentioned this issue Nov 3, 2017

Reenable this test when we get #6059 resolved. imdone/kubernetes#2428

Open

tanjunchen mentioned this issue Jan 1, 2020

/test: resolve pending TODOs #86756

Closed

27claudia mentioned this issue Aug 31, 2020

Unable to open TCP/IP to server for a service running inside the cluster when a new ec2 machine joins the cluster (which has nothing to do with the service) #94360

Closed

Scheduler & Kubelet are not responding immediately on work to do #6059

Scheduler & Kubelet are not responding immediately on work to do #6059

Comments

wojtek-t commented Mar 27, 2015

wojtek-t commented Mar 27, 2015

satnam6502 commented Mar 27, 2015

bprashanth commented Mar 27, 2015

fgrzadkowski commented Mar 27, 2015

wojtek-t commented Mar 27, 2015

yujuhong commented Mar 27, 2015

wojtek-t commented Mar 30, 2015

wojtek-t commented Mar 30, 2015

fgrzadkowski commented Mar 30, 2015

wojtek-t commented Mar 30, 2015

yujuhong commented Mar 30, 2015

dchen1107 commented Mar 30, 2015

lavalamp commented Mar 31, 2015

wojtek-t commented Mar 31, 2015

wojtek-t commented Mar 31, 2015

dchen1107 commented Mar 31, 2015

smarterclayton commented Mar 31, 2015

derekwaynecarr commented Mar 31, 2015

xiang90 commented Mar 31, 2015

dchen1107 commented Mar 31, 2015

wojtek-t commented Apr 3, 2015

lavalamp commented Apr 3, 2015

xiang90 commented Apr 3, 2015

davidopp commented Apr 6, 2015

wojtek-t commented Apr 7, 2015

wojtek-t commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

xiang90 commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

yichengq commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

dchen1107 commented Apr 8, 2015

wojtek-t commented Apr 8, 2015

wojtek-t commented Apr 8, 2015

davidopp commented Apr 8, 2015