Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler & Kubelet are not responding immediately on work to do #6059

Closed
wojtek-t opened this issue Mar 27, 2015 · 104 comments · Fixed by etcd-io/etcd#2622 or #6567
Closed

Scheduler & Kubelet are not responding immediately on work to do #6059

wojtek-t opened this issue Mar 27, 2015 · 104 comments · Fixed by etcd-io/etcd#2622 or #6567
Assignees
Labels
area/etcd priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

While running performance tests (start 30 pods per node in 50 nodes cluster) I faced the following problem.

  1. Create replication controller and resize it to 1500. I saw the following logs:
    I0327 14:45:09.036772 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 0 pods out of 1500
    I0327 14:45:16.918259 14025 density.go:109] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 835 pods out of 1500
    I0327 14:45:24.465116 14025 density.go:129] Controller my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b: Found 1500 pods out of 1500
  2. However, looking into scheduler logs, the first time when it attempts to schedule any of those pods is related to the log:
    I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e

This is roughly ~45 seconds after the first pod was created. The question is what the scheduler was doing in the meantime - the only logs from that time are:
I0327 13:40:52.230407 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:40:52.231778 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [4053/16]) [
5052]
I0327 13:45:53.272954 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:45:53.273448 4551 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:45:53.277590 4551 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]) [1
3988]
E0327 13:45:53.277823 4551 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [12989/5082]
) [13988]
I0327 13:45:55.474005 4551 factory.go:166] About to try and schedule pod my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39bfjo6e

A very similar thing happened in Kubelet. During the same test, the first pod was bind to minion "*0jxm" at:
I0327 13:46:00.291905 4551 factory.go:303] Attempting to bind my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl to e2e-test-wojtekt-minion-0jxm.c.groovy-sentry-504.internal

However, Kubelet reacted to it 50 seconds later:
I0327 13:46:50.560684 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b7aukl_default" found. All containers will be restarted.

Again, it seems that nothing was really happening during that 50 seconds. The Kubelet logs from that time are:
I0327 13:44:55.038542 4712 server.go:654] GET /healthz: (886.441µs) 0 [[monit/5.4] 127.0.0.1:37322]
E0327 13:46:43.817523 4712 reflector.go:149] watch of *api.Node ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17177/7110]) [18176
]
I0327 13:46:46.492099 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
I0327 13:46:46.492181 4712 iowatcher.go:91] Unexpected EOF during watch stream event decoding: unexpected EOF
E0327 13:46:47.256804 4712 reflector.go:149] watch of *api.Service ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17354/6384]) [18
353]
E0327 13:46:47.256950 4712 reflector.go:149] watch of *api.Pod ended with error: 401: The event in requested index is outdated and cleared (the requested history has been cleared [17352/8087]) [18351]
I0327 13:46:50.558471 4712 kubelet.go:1099] No Infra Container for "my-hostname-density30-7bb294b4-d487-11e4-9336-a0481cabf39b8hwgv_default" found. All containers will be restarted.
...
[and it works since then].

Investigate what is happening during that 45 seconds periods of "nothing".

cc @fgrzadkowski

@wojtek-t wojtek-t added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. misc/perf-1.0 labels Mar 27, 2015
@wojtek-t wojtek-t self-assigned this Mar 27, 2015
@wojtek-t wojtek-t added this to the v1.0 milestone Mar 27, 2015
@wojtek-t
Copy link
Member Author

My feeling is that it might be related to some watch issues.

BTW - since I wouldn't be able to work on it today, feel free to pick it up. If noone picks it up, I can start looking into it on Monday.

@wojtek-t wojtek-t removed their assignment Mar 27, 2015
@satnam6502
Copy link
Contributor

/sub

@bprashanth
Copy link
Contributor

However, looking into scheduler logs, the first time when it attempts to schedule any of those pods is related to the log:

I can only reproduce tail latency (#3954). @wojtek-t are your performance tests checked in so I can play with them (preferably with cluster setup instructions)?

Edit: Oh, did you just run @satnam6502 's soak test?

@fgrzadkowski
Copy link
Contributor

I've seen similar problem but in my case scheduler never noticed new pods. Investigating...

AFAIU @wojtek-t has a similar setup to mine:

  • e2e cluster with 50 nodes n1-standard-1, master n1-standard-4.
  • running density test with 30 pods per node.

@wojtek-t
Copy link
Member Author

Yes - I Filip wrote, I'm using Density e2e test, to start 30 pods per node (see also #6051).

Also note that this is not happening all the time - it's just kind of "flakiness"

@yujuhong
Copy link
Contributor

To isolate the performance kubelet more, I ran a simple test which resized the RC to create pods with PodSpec.Host set to a specific node. This bypassed the scheduler completely.

In short, it took ~30 seconds for 30 pods to be reported running on one node. This includes:

  • RC creates 30 pods
  • Kubelet receives pod updates from watch
  • Kubelet starts containers (I used kubernetes/pause images, so there should be no pulling time).
  • Kubelet reports the status back to the apiserver

The cluster size is irrelevant in this case, since only one node is being tested.

@wojtek-t
Copy link
Member Author

I was running a very similar experiments (i.e. 1-node cluster) some time ago and also didn't observe it at all. Thus my feeling is that it might be not related neither to scheduler not kubelet, but probably to etcd and/or apiserver which cannot handle a bigger load.

@davidopp davidopp added team/master priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2015
@wojtek-t
Copy link
Member Author

In fact I think this is related to watch [I think that from some reason some changes come to the component through watch with some significant delay]

@fgrzadkowski
Copy link
Contributor

@yujuhong I don't think cluster size is irrelevant in this experiment, because all the data is read from apiserver/etcd which performance is strictly related with the size of the cluster. As @wojtek-t said we suspect that the problem is somewhere close to watch itself, but we need to dig deeper to understand the problem better.

@wojtek-t
Copy link
Member Author

I've added some logging and I can confirm that this problem is not related to the scheduler or kubelet themself - the issue is (as I thought) that some changes come to those component through watch with a significant delay (e.g. 1 minute). I will continue investigating if it's related to watch itself or maybe some high load on apiserver or etcd.

@yujuhong
Copy link
Contributor

@fgrzadkowski, I meant for the kubelet startup time alone, the cluster size was irrelevant (which was the case in my test). As for when kubelet receives the update for the first time, that's a different story :)
My test was to simply provide another data point to potentially rule out the possibility that kubelet's performance is the bottleneck.

@dchen1107
Copy link
Member

@zmerlynn reported an issue through #6139. After debugging, I think we ran into this issue a lot:

#6139 (comment)

I observed both scheduler and kubelet doesn't pick up the work immediately. The only common pattern used by both components are using watcher.

@lavalamp
Copy link
Member

1500 writes to etcd is a lot. It's enough to expire the 1000 change history. This could mean that if scheduler happens to be doing a re-list when the writes start, it won't be able to watch when it's done with the list, so it will have to do another list.

We should measure how long a list takes the scheduler.

How sure are you that the clocks on your local system and the scheduler are relatively well synchronized?

@wojtek-t
Copy link
Member Author

@dchen1107 : thanks for looking into it too - what you observed in #6139 (comment) is exactly what I also observed for bigger clusters. I didn't observe it in e2e tests, but I also didn't try to find it in e2e tests. But my feeling is that it's much easier to replicate it in bigger clusters.

@lavalamp : yes I agree that this generates high load on etcd - I will try to look deeper at how long "listing pods" lasts - thanks for suggestion!
Also - I'm pretty sure it's not related to the time skew (I also had this hypothesis at some point but verified that it's not the case)

@wojtek-t wojtek-t self-assigned this Mar 31, 2015
@wojtek-t
Copy link
Member Author

I tried to debug it deeper and have few more findings about it. I've added a bunch of different logs and here is what is happening:

  1. Scheduler (or Kubelet) is "opening" watch for pods.
  2. For some time (say 3 minutes) nod pods are changing, so there are no writes with pods in etcd.
  3. In the meantime, a bunch of other writes to etcd is happening (e.g. we're updating NodeStatus once per ~2 seconds). In 50 node cluster, it's clear than more than 1000 writes can happen during that time.
  4. Then new pods are created (by replication controller). However, they are not send through this watch to scheduler [this is something that I still don't understand why]
  5. After 5 minutes since the watch was opened it expires. So we reopen it (still with the previous resourceVersion).
  6. Etcd reports an error: "The event in requested index is outdated and cleared","cause":"the requested history has been cleared"
  7. After that we "start from the beginning": we list all pods, update the internal storage and open watch from the beginning. From that point everything starts to work fine.

So the thing that we need to understand is the point 4, i.e. why the changes are not delivered through watch that was opened some time ago and there were more than 1000 other changes in the meantime (because this is the reason of it) (I'm not 100% sure that those 1000 other changes are not a red herring, but this seems to be the most probable cause)

@dchen1107
Copy link
Member

cc/ @xiang90 coreos etcd expert.

Xiang, is this the same / similar issue as etcd-io/etcd/issues/2048? We are running etcd 2.X, and go-etcd client is 0.4.6.

@smarterclayton
Copy link
Contributor

This is the same issue.

@derekwaynecarr
Copy link
Member

I don't think item 4 in his sequence is the same issue...

@xiang90
Copy link
Contributor

xiang90 commented Mar 31, 2015

After 5 minutes since the watch was opened it expires. So we reopen it (still with the previous resourceVersion).

This has been fixed in etcd2 I think. @wojtek-t

@dchen1107
Copy link
Member

@smarterclayton and @derekwaynecarr I think it is the same issue with some fixes, but not fully in etcd 2.X, and item 4 is the new symptoms.

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 3, 2015

@xiang90 Thanks a lot for fixing it. Regarding canceling - yes we are aware of it. There is already a code that handles it.
One more thing - IIUC in order to this change be effective, we need a new release of etcd, right? So when can we expect the new release?

@lavalamp
Copy link
Member

lavalamp commented Apr 3, 2015

Yeah getting dropped due to high load should be handled correctly already.

@xiang90
Copy link
Contributor

xiang90 commented Apr 3, 2015

@wojtek-t Yes. I will have a release early next week. You can try with the current master though.

@davidopp
Copy link
Member

davidopp commented Apr 6, 2015

@wojtek-t Can you take responsibility for testing the new version of etcd (once it is released) to verify that it fixes this problem? You should also roll back @lavalamp 's workaround (#6381)

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

@davidopp Sure - will take care of it.

@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 7, 2015

@xiang90 Can you please ping this thread/me when the new etcd release will be built? Thanks!

@dchen1107
Copy link
Member

@xiang90 since the fix the very simple, can we have a patch release here?

@xiang90
Copy link
Contributor

xiang90 commented Apr 7, 2015

@wojtek-t @dchen1107 We are doing it. We will release it later today or early tomorrow.

@dchen1107
Copy link
Member

Please let us know as it is ready. We can rebuild our docker image with the release and start the test again.

@yichengq
Copy link
Contributor

yichengq commented Apr 7, 2015

@dchen1107 We just cut the etcd v2.0.9: https://github.com/coreos/etcd/tree/v2.0.9
It contains the bug fix for watcher issue.

@dchen1107
Copy link
Member

Great. I will build our docker image for this, @wojtek-t you can pick up tonight PCT for your performance validation. Thanks @yichengq!

@dchen1107
Copy link
Member

@wojtek-t You can upgrade your cluster with 2.0.9 and validate the fix now.

brendandburns pushed a commit to brendandburns/kubernetes that referenced this issue Apr 8, 2015
(cherry picked from commit 6be09c5)
@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 8, 2015

@dchen1107 Will do it today. Thanks for creating the image.

wojtek-t added a commit to wojtek-t/kubernetes that referenced this issue Apr 8, 2015
@wojtek-t
Copy link
Member Author

wojtek-t commented Apr 8, 2015

I wasn't able to reproduce this problem with the new version of etcd (it was very easy to reproduce with the previous version).
So I'm going to revert the temporary workaround done by @lavalamp and close this issue.

fgrzadkowski added a commit that referenced this issue Apr 8, 2015
Revert "Stop the bleeding for #6059"
@davidopp
Copy link
Member

davidopp commented Apr 8, 2015

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet