Nodes seem to be updating its status way too frequently #5864

fabioy · 2015-03-24T18:55:12Z

May be a result of PR #5714. On GCE, the master kube-apiserver.log is flooded with entries like:

I0324 18:53:46.443512 9483 handlers.go:109] GET /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (2.978675593s) 200 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46823]
I0324 18:53:46.444040 9483 handlers.go:109] GET /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (2.979451401s) 200 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46800]
I0324 18:53:46.450698 9483 handlers.go:109] PUT /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (3.778122997s) 409 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:41179]
I0324 18:53:46.451696 9483 handlers.go:109] PUT /api/v1beta1/minions/e2e-test-fabioy-minion-ipwx.c.fabioy-cloud-test-1.internal: (3.772748134s) 409 [[kubelet/v0.10.0 (linux/amd64) kubernetes/9707a94] 10.240.151.207:46706]

The log is filling at a rate of hundreds of these types of requests a second (this is during e2e test on GCE).

At the very least, there should be a rate limiter on how often pods update/fetch their status.

fabioy · 2015-03-24T18:56:02Z

@fgrzadkowski, @vmarmol

vmarmol · 2015-03-24T20:03:56Z

Taking a look, thanks for reporting @fabioy!

vmarmol · 2015-03-24T20:13:36Z

@fabioy these look like node statuses not pod statuses?

fabioy · 2015-03-24T20:20:21Z

You're probably right. I couldn't readily tell from the logs.

On Tue, Mar 24, 2015 at 1:14 PM, Victor Marmol notifications@github.com
wrote:

@fabioy https://github.com/fabioy these look like node statuses not pod
statuses?

—
Reply to this email directly or view it on GitHub
#5864 (comment)
.

-- Fabio Yeon

vmarmol · 2015-03-24T20:44:58Z

Load-balancing to @dchen1107 :) assigning to you

dchen1107 · 2015-03-25T00:53:35Z

Found the bug in the code, will sent out fix shortly.

dchen1107 · 2015-03-25T16:51:52Z

It turns out this is no bug, and it works as intended. Under current design, we treat nodeStatus as heatbeat message:

When kubelet first starts up, it post nodeStatus every 500 millisec for first 2sec for a faster cluster startup. After this, kubelet post nodeStatus every 2sec. NodeController processes those heatbeat messages, and if it missed sequentially continuous 4 heatbeats, it will mark unreachable.

Above intervals is reasonable to me, but we could tune them though. I am going to close the issue.

cc/ @ddysher @bgrant0607

ghost · 2015-03-25T20:11:44Z

As an aside, TCP keepalives are a very cheap way to do these sorts of remote heartbeats (all in-kernel). Then you can potentially do low latency edge-triggered rather than level triggered event reporting.

dchen1107 · 2015-03-25T20:16:58Z

@quinton-hoole I raised the same point when first introducing nodeStatus as heatbeat. If I remembered it correctly, the answer is that we can evolve it later. @ddysher Correct me if I were wrong here.

ddysher · 2015-03-25T20:55:18Z

Based on the log timing, it's just for faster startup, like @dchen1107 said.

As to performance, yes, we wanted to evolve it later (which is 'now').

ghodss · 2015-04-07T05:43:23Z

Is there another issue tracking the reduction of these PUTs?

ghodss · 2015-04-07T05:50:40Z

Just as a heads up, I started up a completely idle 0.14.2 500 node cluster with an n1-standard-8 instance as the master and the master is completely 100% loaded to the point that 50% of requests return 429 due to node status GET's and PUT's. I know the target 1.0 size is 100 nodes but it would be great to optimize this at least a bit.

gmarek · 2015-04-07T06:05:59Z

@wojtek-t @fgrzadkowski

wojtek-t · 2015-04-07T06:24:35Z

In my opinion, the first thing that we should do is to increase the "heartbeat duration".
IIUC, the main reason for it is to move the pods that are running on broken machine to another machine in case the machine is unreachable. But I think this case is pretty similar to restarting the pod that failed (e.g. due to crash). However, we are currently examining pods only every 10 seconds. So for me it doesn't give much value to send heartbeat from a node every 2 seconds.
But I also agree that we should probably come up with a better mechanism than sending the whole NodeStatus from Kubelet as a heartbeat.

dchen1107 · 2015-04-07T16:42:05Z

The issue was filed because we believe there is a bug, but turns out it is config parameters of interval at startup time. To tune the NodeStatus interval and other NodeStatus related performance issues are covered by #5953 and several other issues.

dchen1107 · 2015-04-07T16:43:16Z

@ghodss Let's move the discussion to #5953.

fabioy mentioned this issue Mar 24, 2015

Client should have a rate limiter so that it won't flood the server #5865

Closed

fabioy added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/kubelet kind/bug Categorizes issue or PR as related to a bug. labels Mar 24, 2015

vmarmol self-assigned this Mar 24, 2015

vmarmol added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 24, 2015

vmarmol assigned dchen1107 and unassigned vmarmol Mar 24, 2015

fabioy mentioned this issue Mar 24, 2015

API server should limit the number of concurrent requests it processes #5866

Closed

brendandburns added this to the v1.0 milestone Mar 24, 2015

dchen1107 changed the title ~~Pods seem to be updating its status way too frequently~~ Nodes seem to be updating its status way too frequently Mar 24, 2015

dchen1107 closed this as completed Mar 25, 2015

dchen1107 mentioned this issue Apr 7, 2015

Measure and optimize push-based heartbeats from Kubelet to NodeController #5953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes seem to be updating its status way too frequently #5864

Nodes seem to be updating its status way too frequently #5864

fabioy commented Mar 24, 2015

fabioy commented Mar 24, 2015

vmarmol commented Mar 24, 2015

vmarmol commented Mar 24, 2015

fabioy commented Mar 24, 2015

vmarmol commented Mar 24, 2015

dchen1107 commented Mar 25, 2015

dchen1107 commented Mar 25, 2015

ghost commented Mar 25, 2015

dchen1107 commented Mar 25, 2015

ddysher commented Mar 25, 2015

ghodss commented Apr 7, 2015

ghodss commented Apr 7, 2015

gmarek commented Apr 7, 2015

wojtek-t commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

Nodes seem to be updating its status way too frequently #5864

Nodes seem to be updating its status way too frequently #5864

Comments

fabioy commented Mar 24, 2015

fabioy commented Mar 24, 2015

vmarmol commented Mar 24, 2015

vmarmol commented Mar 24, 2015

fabioy commented Mar 24, 2015

vmarmol commented Mar 24, 2015

dchen1107 commented Mar 25, 2015

dchen1107 commented Mar 25, 2015

ghost commented Mar 25, 2015

dchen1107 commented Mar 25, 2015

ddysher commented Mar 25, 2015

ghodss commented Apr 7, 2015

ghodss commented Apr 7, 2015

gmarek commented Apr 7, 2015

wojtek-t commented Apr 7, 2015

dchen1107 commented Apr 7, 2015

dchen1107 commented Apr 7, 2015