Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd-gcp memory consumption has increased #77492

Closed
mm4tt opened this issue May 6, 2019 · 13 comments
Closed

fluentd-gcp memory consumption has increased #77492

mm4tt opened this issue May 6, 2019 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@mm4tt
Copy link
Contributor

mm4tt commented May 6, 2019

It's causing flakiness of the pull-kubernetes-e2e-gce-100-performance, see #73884 (comment)

It's also visible in the perf-dash graph
mKMPjAcrwZk

We'll increase the memory constraint in kubernetes/perf-tests#524 in order to deflake the presubmits, but we should figure out what caused the increase in the memory usage.

@mm4tt mm4tt added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 6, 2019
@mm4tt
Copy link
Contributor Author

mm4tt commented May 6, 2019

/sig scalability
/assign

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2019
@mm4tt
Copy link
Contributor Author

mm4tt commented May 6, 2019

We can rule out fluentd-gcp upgrade, the version that we use (v.3.2.0) hasn't changed since 1.13

@mm4tt
Copy link
Contributor Author

mm4tt commented May 6, 2019

/assign @x13n

@x13n, could you take a look and help us debug?

@x13n
Copy link
Member

x13n commented May 6, 2019

Since the version didn't change, my guess is that some component(s) started generating more logs. Did CPU usage increase as well?

@mm4tt
Copy link
Contributor Author

mm4tt commented May 6, 2019

I believe it slightly increased, but it's not as visible as the memory increase, see http://perf-dash.k8s.io/#/?jobname=gce-100Nodes&metriccategoryname=E2E&metricname=DensityResources&PodName=fluentd-gcp%2Ffluentd-gcp&Resource=CPU

@x13n
Copy link
Member

x13n commented May 6, 2019

Interesting. Can we rule out backend latency increase somehow? If the mean latency increased, the buffers utilization would follow. Unfortunately, fluentd doesn't expose request latency metrics today, only success/failures: https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/blob/master/lib/fluent/plugin/out_google_cloud.rb

/cc @qingling128 @bmoyles0117

@tedyu
Copy link
Contributor

tedyu commented May 6, 2019

Is it possible to perform some sort of bisecting to narrow the scope of search ?

Thanks

@qingling128
Copy link
Contributor

Not sure if it's related, but we observed some memory consumption difference between k8s versions < 1.10 and versions >=1.10: fluent/fluentd#2236 (comment). Originally we thought it's triggered by log rotation change in k8s. But it's only a hypothesis that has not been proved.

@mm4tt
Copy link
Contributor Author

mm4tt commented May 7, 2019

Bisection would be possible, but may be hard due to high variance of the memory consumption (see the first graph). Also we don't really have manpower to do it, we're currently facing few other regressions that have priority over this one.

Any other ideas how to debug this? Is fluentd-gcp exposing any prometheus metrics that might be useful to debug this? Can we compare some fluentd-gcp logs from a good run and bad run to see if there is anything that stands out?

@x13n
Copy link
Member

x13n commented May 8, 2019

I'm not sure eyeballing fluentd logs would expose memory problems.

fluentd_status_buffer_queue_length/fluentd_status_buffer_total_bytes metrics might be worth looking at, to confirm that the memory increase is indeed coming from the buffers.

@mm4tt
Copy link
Contributor Author

mm4tt commented May 15, 2019

I was wrong saying that the version of fluentd-gcp hasn't changed.
It was bumped up from 1.6.0 to 1.6.8 in #77224.
The timing correlates with the increase we see in the perf-dash, so it's the culprit most likely.

@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 16, 2019

Sorry, this got deprioritized. The tests still flakes due to that, I think at this point the only reasonable action is to assume that the increase was caused by fluentd version upgrade and increase the constraints to add some headroom and deflake the test.

Will send a PR shortly.

mm4tt added a commit to mm4tt/perf-tests that referenced this issue Jul 16, 2019
This should deflake the test, the increase was most likely caused by
fluentd version upgrade.

Ref. kubernetes/kubernetes#77492
@mm4tt
Copy link
Contributor Author

mm4tt commented Jul 16, 2019

Ref. #80212

@mm4tt mm4tt closed this as completed Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

5 participants