-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler & Kubelet are not responding immediately on work to do #6059
Comments
My feeling is that it might be related to some watch issues. BTW - since I wouldn't be able to work on it today, feel free to pick it up. If noone picks it up, I can start looking into it on Monday. |
/sub |
I can only reproduce tail latency (#3954). @wojtek-t are your performance tests checked in so I can play with them (preferably with cluster setup instructions)? Edit: Oh, did you just run @satnam6502 's soak test? |
I've seen similar problem but in my case scheduler never noticed new pods. Investigating... AFAIU @wojtek-t has a similar setup to mine:
|
Yes - I Filip wrote, I'm using Density e2e test, to start 30 pods per node (see also #6051). Also note that this is not happening all the time - it's just kind of "flakiness" |
To isolate the performance kubelet more, I ran a simple test which resized the RC to create pods with In short, it took ~30 seconds for 30 pods to be reported running on one node. This includes:
The cluster size is irrelevant in this case, since only one node is being tested. |
I was running a very similar experiments (i.e. 1-node cluster) some time ago and also didn't observe it at all. Thus my feeling is that it might be not related neither to scheduler not kubelet, but probably to etcd and/or apiserver which cannot handle a bigger load. |
In fact I think this is related to watch [I think that from some reason some changes come to the component through watch with some significant delay] |
@yujuhong I don't think cluster size is irrelevant in this experiment, because all the data is read from apiserver/etcd which performance is strictly related with the size of the cluster. As @wojtek-t said we suspect that the problem is somewhere close to watch itself, but we need to dig deeper to understand the problem better. |
I've added some logging and I can confirm that this problem is not related to the scheduler or kubelet themself - the issue is (as I thought) that some changes come to those component through watch with a significant delay (e.g. 1 minute). I will continue investigating if it's related to watch itself or maybe some high load on apiserver or etcd. |
@fgrzadkowski, I meant for the kubelet startup time alone, the cluster size was irrelevant (which was the case in my test). As for when kubelet receives the update for the first time, that's a different story :) |
1500 writes to etcd is a lot. It's enough to expire the 1000 change history. This could mean that if scheduler happens to be doing a re-list when the writes start, it won't be able to watch when it's done with the list, so it will have to do another list. We should measure how long a list takes the scheduler. How sure are you that the clocks on your local system and the scheduler are relatively well synchronized? |
@dchen1107 : thanks for looking into it too - what you observed in #6139 (comment) is exactly what I also observed for bigger clusters. I didn't observe it in e2e tests, but I also didn't try to find it in e2e tests. But my feeling is that it's much easier to replicate it in bigger clusters. @lavalamp : yes I agree that this generates high load on etcd - I will try to look deeper at how long "listing pods" lasts - thanks for suggestion! |
I tried to debug it deeper and have few more findings about it. I've added a bunch of different logs and here is what is happening:
So the thing that we need to understand is the point 4, i.e. why the changes are not delivered through watch that was opened some time ago and there were more than 1000 other changes in the meantime (because this is the reason of it) (I'm not 100% sure that those 1000 other changes are not a red herring, but this seems to be the most probable cause) |
cc/ @xiang90 coreos etcd expert. Xiang, is this the same / similar issue as etcd-io/etcd/issues/2048? We are running etcd 2.X, and go-etcd client is 0.4.6. |
This is the same issue. |
I don't think item 4 in his sequence is the same issue... |
This has been fixed in etcd2 I think. @wojtek-t |
@smarterclayton and @derekwaynecarr I think it is the same issue with some fixes, but not fully in etcd 2.X, and item 4 is the new symptoms. |
@xiang90 Thanks a lot for fixing it. Regarding canceling - yes we are aware of it. There is already a code that handles it. |
Yeah getting dropped due to high load should be handled correctly already. |
@wojtek-t Yes. I will have a release early next week. You can try with the current master though. |
@davidopp Sure - will take care of it. |
@xiang90 Can you please ping this thread/me when the new etcd release will be built? Thanks! |
@xiang90 since the fix the very simple, can we have a patch release here? |
@wojtek-t @dchen1107 We are doing it. We will release it later today or early tomorrow. |
Please let us know as it is ready. We can rebuild our docker image with the release and start the test again. |
@dchen1107 We just cut the etcd v2.0.9: https://github.com/coreos/etcd/tree/v2.0.9 |
@wojtek-t You can upgrade your cluster with 2.0.9 and validate the fix now. |
(cherry picked from commit 6be09c5)
@dchen1107 Will do it today. Thanks for creating the image. |
This reverts commit 6be09c5.
I wasn't able to reproduce this problem with the new version of etcd (it was very easy to reproduce with the previous version). |
Revert "Stop the bleeding for #6059"
Yay! |
While running performance tests (start 30 pods per node in 50 nodes cluster) I faced the following problem.
I0327 14:45:09.036772 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 0 pods out of 1500
I0327 14:45:16.918259 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 835 pods out of 1500
I0327 14:45:24.465116 14025 density.go:129] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 1500 pods out of 1500
I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e
This is roughly ~45 seconds after the first pod was created. The question is what the scheduler was doing in the meantime - the only logs from that time are:
I0327 13:40:52.230407 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:40:52.231778 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [4053/16]) [
5052]
I0327 13:45:53.272954 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:45:53.273448 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:45:53.277590 4551 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]) [1
3988]
E0327 13:45:53.277823 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]
) [13988]
I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e
A very similar thing happened in Kubelet. During the same test, the first pod was bind to minion "*0jxm" at:
I0327 13:46:00.291905 4551 factory.go:303] Attempting to bind my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl to e2e-test-wojtekt-minion-0jxm.c.groovy-sentry-504.internal
However, Kubelet reacted to it 50 seconds later:
I0327 13:46:50.560684 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl_default" found. All containers will be restarted.
Again, it seems that nothing was really happening during that 50 seconds. The Kubelet logs from that time are:
I0327 13:44:55.038542 4712 server.go:654] GET /healthz: (886.441µs) 0 [[monit/5.4] 127.0.0.1:37322]
E0327 13:46:43.817523 4712 reflector.go:149] watch of *api.Node ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17177/7110]) [18176
]
I0327 13:46:46.492099 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:46:46.492181 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:46:47.256804 4712 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17354/6384]) [18
353]
E0327 13:46:47.256950 4712 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17352/8087]) [18351]
I0327 13:46:50.558471 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b8hwgv_default" found. All containers will be restarted.
...
[and it works since then].
Investigate what is happening during that 45 seconds periods of "nothing".
cc @fgrzadkowski
The text was updated successfully, but these errors were encountered: