fluentd-gcp memory consumption has increased #77492

mm4tt · 2019-05-06T08:44:30Z

It's causing flakiness of the pull-kubernetes-e2e-gce-100-performance, see #73884 (comment)

It's also visible in the perf-dash graph

We'll increase the memory constraint in kubernetes/perf-tests#524 in order to deflake the presubmits, but we should figure out what caused the increase in the memory usage.

mm4tt · 2019-05-06T08:44:51Z

/sig scalability
/assign

mm4tt · 2019-05-06T08:46:03Z

We can rule out fluentd-gcp upgrade, the version that we use (v.3.2.0) hasn't changed since 1.13

mm4tt · 2019-05-06T08:58:39Z

/assign @x13n

@x13n, could you take a look and help us debug?

x13n · 2019-05-06T09:47:10Z

Since the version didn't change, my guess is that some component(s) started generating more logs. Did CPU usage increase as well?

mm4tt · 2019-05-06T09:55:53Z

I believe it slightly increased, but it's not as visible as the memory increase, see http://perf-dash.k8s.io/#/?jobname=gce-100Nodes&metriccategoryname=E2E&metricname=DensityResources&PodName=fluentd-gcp%2Ffluentd-gcp&Resource=CPU

x13n · 2019-05-06T10:25:24Z

Interesting. Can we rule out backend latency increase somehow? If the mean latency increased, the buffers utilization would follow. Unfortunately, fluentd doesn't expose request latency metrics today, only success/failures: https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/blob/master/lib/fluent/plugin/out_google_cloud.rb

/cc @qingling128 @bmoyles0117

tedyu · 2019-05-06T15:46:59Z

Is it possible to perform some sort of bisecting to narrow the scope of search ?

Thanks

qingling128 · 2019-05-06T18:16:35Z

Not sure if it's related, but we observed some memory consumption difference between k8s versions < 1.10 and versions >=1.10: fluent/fluentd#2236 (comment). Originally we thought it's triggered by log rotation change in k8s. But it's only a hypothesis that has not been proved.

mm4tt · 2019-05-07T07:57:17Z

Bisection would be possible, but may be hard due to high variance of the memory consumption (see the first graph). Also we don't really have manpower to do it, we're currently facing few other regressions that have priority over this one.

Any other ideas how to debug this? Is fluentd-gcp exposing any prometheus metrics that might be useful to debug this? Can we compare some fluentd-gcp logs from a good run and bad run to see if there is anything that stands out?

x13n · 2019-05-08T07:45:15Z

I'm not sure eyeballing fluentd logs would expose memory problems.

fluentd_status_buffer_queue_length/fluentd_status_buffer_total_bytes metrics might be worth looking at, to confirm that the memory increase is indeed coming from the buffers.

mm4tt · 2019-05-15T07:18:56Z

I was wrong saying that the version of fluentd-gcp hasn't changed.
It was bumped up from 1.6.0 to 1.6.8 in #77224.
The timing correlates with the increase we see in the perf-dash, so it's the culprit most likely.

mm4tt · 2019-07-16T11:59:20Z

Sorry, this got deprioritized. The tests still flakes due to that, I think at this point the only reasonable action is to assume that the increase was caused by fluentd version upgrade and increase the constraints to add some headroom and deflake the test.

Will send a PR shortly.

This should deflake the test, the increase was most likely caused by fluentd version upgrade. Ref. kubernetes/kubernetes#77492

mm4tt · 2019-07-16T12:13:01Z

Ref. #80212

mm4tt added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2019

k8s-ci-robot assigned mm4tt May 6, 2019

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2019

mm4tt mentioned this issue May 6, 2019

Increase fluentd-gcp memory constraint. kubernetes/perf-tests#524

Closed

k8s-ci-robot assigned x13n May 6, 2019

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 16, 2019

Increase fluentd memoryConstraint to 260 MB

6d30ab0

This should deflake the test, the increase was most likely caused by fluentd version upgrade. Ref. kubernetes/kubernetes#77492

mm4tt mentioned this issue Jul 16, 2019

Increase fluentd memoryConstraint to 260 MB kubernetes/perf-tests#643

Merged

mm4tt closed this as completed Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluentd-gcp memory consumption has increased #77492

fluentd-gcp memory consumption has increased #77492

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

x13n commented May 6, 2019

mm4tt commented May 6, 2019

x13n commented May 6, 2019

tedyu commented May 6, 2019

qingling128 commented May 6, 2019

mm4tt commented May 7, 2019

x13n commented May 8, 2019

mm4tt commented May 15, 2019

mm4tt commented Jul 16, 2019

mm4tt commented Jul 16, 2019

fluentd-gcp memory consumption has increased #77492

fluentd-gcp memory consumption has increased #77492

Comments

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

mm4tt commented May 6, 2019

x13n commented May 6, 2019

mm4tt commented May 6, 2019

x13n commented May 6, 2019

tedyu commented May 6, 2019

qingling128 commented May 6, 2019

mm4tt commented May 7, 2019

x13n commented May 8, 2019

mm4tt commented May 15, 2019

mm4tt commented Jul 16, 2019

mm4tt commented Jul 16, 2019