-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluentd-gcp memory consumption has increased #77492
Comments
/sig scalability |
We can rule out fluentd-gcp upgrade, the version that we use (v.3.2.0) hasn't changed since 1.13 |
Since the version didn't change, my guess is that some component(s) started generating more logs. Did CPU usage increase as well? |
I believe it slightly increased, but it's not as visible as the memory increase, see http://perf-dash.k8s.io/#/?jobname=gce-100Nodes&metriccategoryname=E2E&metricname=DensityResources&PodName=fluentd-gcp%2Ffluentd-gcp&Resource=CPU |
Interesting. Can we rule out backend latency increase somehow? If the mean latency increased, the buffers utilization would follow. Unfortunately, fluentd doesn't expose request latency metrics today, only success/failures: https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/blob/master/lib/fluent/plugin/out_google_cloud.rb |
Is it possible to perform some sort of bisecting to narrow the scope of search ? Thanks |
Not sure if it's related, but we observed some memory consumption difference between k8s versions < 1.10 and versions >=1.10: fluent/fluentd#2236 (comment). Originally we thought it's triggered by log rotation change in k8s. But it's only a hypothesis that has not been proved. |
Bisection would be possible, but may be hard due to high variance of the memory consumption (see the first graph). Also we don't really have manpower to do it, we're currently facing few other regressions that have priority over this one. Any other ideas how to debug this? Is fluentd-gcp exposing any prometheus metrics that might be useful to debug this? Can we compare some fluentd-gcp logs from a good run and bad run to see if there is anything that stands out? |
I'm not sure eyeballing fluentd logs would expose memory problems. fluentd_status_buffer_queue_length/fluentd_status_buffer_total_bytes metrics might be worth looking at, to confirm that the memory increase is indeed coming from the buffers. |
I was wrong saying that the version of fluentd-gcp hasn't changed. |
Sorry, this got deprioritized. The tests still flakes due to that, I think at this point the only reasonable action is to assume that the increase was caused by fluentd version upgrade and increase the constraints to add some headroom and deflake the test. Will send a PR shortly. |
This should deflake the test, the increase was most likely caused by fluentd version upgrade. Ref. kubernetes/kubernetes#77492
Ref. #80212 |
It's causing flakiness of the pull-kubernetes-e2e-gce-100-performance, see #73884 (comment)
It's also visible in the perf-dash graph
We'll increase the memory constraint in kubernetes/perf-tests#524 in order to deflake the presubmits, but we should figure out what caused the increase in the memory usage.
The text was updated successfully, but these errors were encountered: