-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes seem to be updating its status way too frequently #5864
Comments
Taking a look, thanks for reporting @fabioy! |
@fabioy these look like node statuses not pod statuses? |
You're probably right. I couldn't readily tell from the logs. On Tue, Mar 24, 2015 at 1:14 PM, Victor Marmol notifications@github.com
-- Fabio Yeon |
Load-balancing to @dchen1107 :) assigning to you |
Found the bug in the code, will sent out fix shortly. |
It turns out this is no bug, and it works as intended. Under current design, we treat nodeStatus as heatbeat message: When kubelet first starts up, it post nodeStatus every 500 millisec for first 2sec for a faster cluster startup. After this, kubelet post nodeStatus every 2sec. NodeController processes those heatbeat messages, and if it missed sequentially continuous 4 heatbeats, it will mark unreachable. Above intervals is reasonable to me, but we could tune them though. I am going to close the issue. cc/ @ddysher @bgrant0607 |
As an aside, TCP keepalives are a very cheap way to do these sorts of remote heartbeats (all in-kernel). Then you can potentially do low latency edge-triggered rather than level triggered event reporting. |
@quinton-hoole I raised the same point when first introducing nodeStatus as heatbeat. If I remembered it correctly, the answer is that we can evolve it later. @ddysher Correct me if I were wrong here. |
Based on the log timing, it's just for faster startup, like @dchen1107 said. As to performance, yes, we wanted to evolve it later (which is 'now'). |
Is there another issue tracking the reduction of these PUTs? |
Just as a heads up, I started up a completely idle 0.14.2 500 node cluster with an n1-standard-8 instance as the master and the master is completely 100% loaded to the point that 50% of requests return 429 due to node status GET's and PUT's. I know the target 1.0 size is 100 nodes but it would be great to optimize this at least a bit. |
In my opinion, the first thing that we should do is to increase the "heartbeat duration". |
The issue was filed because we believe there is a bug, but turns out it is config parameters of interval at startup time. To tune the NodeStatus interval and other NodeStatus related performance issues are covered by #5953 and several other issues. |
May be a result of PR #5714. On GCE, the master kube-apiserver.log is flooded with entries like:
I0324 18:53:46.443512 9483 handlers.go:109] GET /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (2.978675593s) 200 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46823]
I0324 18:53:46.444040 9483 handlers.go:109] GET /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (2.979451401s) 200 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46800]
I0324 18:53:46.450698 9483 handlers.go:109] PUT /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (3.778122997s) 409 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:41179]
I0324 18:53:46.451696 9483 handlers.go:109] PUT /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (3.772748134s) 409 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46706]
The log is filling at a rate of hundreds of these types of requests a second (this is during e2e test on GCE).
At the very least, there should be a rate limiter on how often pods update/fetch their status.
The text was updated successfully, but these errors were encountered: