buffer_strategy = "disk" stops working after a while #16314

jdstrand · 2024-12-16T21:44:09Z

Recently I turned on buffer_strategy = "disk" for an application writing line protocol into telegraf's inputs.influxdb_v2_listener and telegraf writing into outputs.influxdb_v2.

Telegraf version: 1.33.0 (and 1.32.x before it)
amd64 Ubuntu (20.04 and 22.04)

Background

At the time, I was using 1.32 and the setup was:

app.1 ----              ------> InfluxDB Dedicated
          v            /
app.2 --->8186:telegraf ------> local network OSS 2.x
          ^ ^          \
app.3 ----  |           ------> local network OSS 1.x
app.4 ------

This was all working fine with memory. After changing to disk things worked for awhile, but then I started getting client errors of this form:

Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing after 5 retries, giving up"
Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999086358991917)"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing after 5 retries, giving up"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.997719279956073)"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing after 5 retries, giving up"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999581227020826)"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,407 "Error while writing after 5 retries, giving up"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,408 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999311759951524)"

There are no corresponding telegraf logs. The timeout is curious as it is all just under flush_interval which is set to flush_interval = "10s".... (but see below)

Once I set it back to memory, the client errors went away.

As an additional data point, the client was seeing some intermittent network issues which might cause disconnects or slow downs. My goal was to deploy telegraf as a proxy between my clients and InfluxDB with buffer_strategy = "disk" as a safeguard so as not to lose points in the event of network errors and the like.

Reproducer

In an effort to create a simple(r) reproducer, I created a python script to emulate a 'slow server':

#!/usr/bin/env python3

import getopt
from http.server import BaseHTTPRequestHandler, HTTPServer
import random
import sys
import time

class SlowServerHandler(BaseHTTPRequestHandler):
    delay: int = 10
    flaky: int = 0
    def slowdown(self, path: str) -> None:
        if path.startswith("/slow"):
            t = int(path[5:])
            if t == 0:
                return
            if t > 0 and t < 600:  # 1 sec to 10 min
                print("%s: delaying %d sec(s)..." % (path, t))
                time.sleep(t)
                return
            raise ValueError("should be integer >= 0 and <= 600")
        else:
            d = delay
            if self.flaky:
                d = random.randint(0, self.flaky)
            print("%s: delaying %d sec(s)..." % (path, d))
            time.sleep(d)
            return

    def fake_request(self):
        code = 200
        try:
            self.slowdown(self.path)
        except Exception:
            code = 500

        self.send_response(code)
        if code == 200:
            self.send_header("Content-type", "application/json")
        self.end_headers()
        if code == 200:
            try:
                self.wfile.write(b"{}\n")
            except Exception:
                print("%s: broken pipe (client disconnect?)" % self.path)

    def do_GET(self):
        self.fake_request()

    def do_POST(self):
        self.fake_request()

def run(delay: int, port: int, flaky: int):
    # XXX: validate delay and port
    addr = ('', port)
    httpd = HTTPServer(addr, SlowServerHandler)
    if flaky > 0:
        SlowServerHandler.flaky = flaky
    else:
        SlowServerHandler.delay = delay
    httpd.serve_forever()

if __name__ == '__main__':
    global delay
    try:
        opts, rest = getopt.getopt(sys.argv[1:], "hf:d:p:", [])
    except Exception:
        print("Bad args. See -h")
        sys.exit(1)

    port: int = 8000
    delay: int = 10
    flaky: int = 0
    for opt, arg in opts:
        if opt == "-h":
            print("""Usage: slow-server.py ARGS
 -h     this message
 -d N   set response delay in seconds (default=10)
 -p N   listen on port N
 -f N   simulate flaky server (0-N delay on every request)

GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').

If the URL path starts with /slowN, then delay N seconds.
""")
            sys.exit(0)
        elif opt == "-d":
            delay = int(arg)
        elif opt == "-p":
            port = int(arg)
        elif opt == "-f":
            flaky = int(arg)
    try:
        run(delay, port, flaky)
    except KeyboardInterrupt:
        pass

Eg:

$ ./slow-server.py -h
Usage: slow-server.py ARGS
 -h     this message
 -d N   set response delay in seconds (default=10)
 -p N   listen on port N
 -f N   simulate flaky server (0-N delay on every request)

GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').

If the URL path starts with /slowN, then delay N seconds.

I then setup telegraf to listen on port 8086 and write out to the local server and use curl in a loop to write out to the server. Specifically:

In one terminal, start the server without any added delays (ie, it isn't 'slow' yet): $ python3 ./slow-server.py -d 0

Create a telegraf configuration to proxy and use disk buffer

$ cat ./telegraf.conf
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

  # 'memory' or 'disk'
  buffer_strategy = "disk"
  buffer_directory = "/tmp/tg"

# egress to slow server (must serve /api/v2/write)
[[outputs.influxdb_v2]]
  urls = ["http://127.0.0.1:8000"]

# ingress from curl, etc
[[inputs.influxdb_v2_listener]]
  service_address = "127.0.0.1:8086"
  read_timeout = "10s"
  write_timeout = "10s"
  max_body_size = 0

create the disk buffer directory: mkdir /tmp/tg

start telegraf with telegraf --config ./telegraf.config

$ ~/code/telegraf --config ./telegraf.conf 
2024-12-16T21:00:58Z I! Loading config: ./telegraf.conf
2024-12-16T21:00:58Z W! Using disk buffer strategy for plugin outputs.influxdb_v2, this is an experimental feature
2024-12-16T21:00:58Z I! WAL file not found for plugin outputs.influxdb_v2 (14c21c01a6d4a9f123336ae9b187b07a76ef66f7fc104dce4060725e0035d58a), this can safely be ignored if you added this plugin instance for the first time
2024-12-16T21:00:58Z I! Starting Telegraf 1.33.0 brought to you by InfluxData the makers of InfluxDB
2024-12-16T21:00:58Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2024-12-16T21:00:58Z I! Loaded inputs: influxdb_v2_listener
2024-12-16T21:00:58Z I! Loaded aggregators:
2024-12-16T21:00:58Z I! Loaded processors:
2024-12-16T21:00:58Z I! Loaded secretstores:
2024-12-16T21:00:58Z I! Loaded outputs: influxdb_v2
2024-12-16T21:00:58Z I! Tags enabled: host=sec
2024-12-16T21:00:58Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sec", Flush Interval:10s
2024-12-16T21:00:58Z I! [inputs.influxdb_v2_listener] Started HTTP listener service on 127.0.0.1:8086
...at this point I hit Ctrl-c; notice no errors above from telegraf...
2024-12-16T21:03:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-12-16T21:03:27Z I! [agent] Stopping running outputs

download a line protocol generator: go install github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen@latest

then send line protocol into telegraf in a loop. After a short while, telegraf starts outputting curl: (52) Empty reply from server instead of HTTP/1.1 204 No Content. Eg:

$ while /bin/true ; do echo "# $(date)" ; ~/go/bin/bulk_data_gen | head -9999 > /tmp/lp && curl -i -X POST http://127.0.0.1:8086/api/v2/write --data-binary @/tmp/lp && sleep 10 ; done
# Mon Dec 16 15:01:22 CST 2024
using random seed 232901789
2024/12/16 15:01:22 Using sampling interval 10s
2024/12/16 15:01:22 Using cardinality of 9
HTTP/1.1 100 Continue

HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:01:30 GMT

# Mon Dec 16 15:01:40 CST 2024
using random seed 644661411
2024/12/16 15:01:40 Using sampling interval 10s
2024/12/16 15:01:40 Using cardinality of 9
HTTP/1.1 100 Continue

HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:01:49 GMT

# Mon Dec 16 15:01:59 CST 2024
using random seed 179797224
2024/12/16 15:01:59 Using sampling interval 10s
2024/12/16 15:01:59 Using cardinality of 9
HTTP/1.1 100 Continue

HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:02:08 GMT

# Mon Dec 16 15:02:18 CST 2024
using random seed 729922716
2024/12/16 15:02:18 Using sampling interval 10s
2024/12/16 15:02:18 Using cardinality of 9
HTTP/1.1 100 Continue

curl: (52) Empty reply from server
# Mon Dec 16 15:02:29 CST 2024
using random seed 79912443
2024/12/16 15:02:29 Using sampling interval 10s
2024/12/16 15:02:29 Using cardinality of 9
HTTP/1.1 100 Continue

curl: (52) Empty reply from server
# Mon Dec 16 15:02:40 CST 2024
using random seed 602442651
2024/12/16 15:02:40 Using sampling interval 10s
2024/12/16 15:02:40 Using cardinality of 9
HTTP/1.1 100 Continue

curl: (52) Empty reply from server

With this setup, I can see files getting populated in /tmp/tg. The client curl starts out outputting HTTP/1.1 204 No Content but after about a minute, I start seeing curl: (52) Empty reply from server. I believe this is successfully demonstrating the problem. Interestingly, the python server is showing telegraf writing throughout (it sends 1000 points at a time (due to metric_batch_size = 1000); see the terminal output where the server is running).

$ ./slow-server.py -d 0
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:25] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:27] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:41] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:42] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:43] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:44] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:46] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:47] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:48] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:49] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:58] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:00] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:01] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:02] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:03] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:04] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:04] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:05] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:06] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:07] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:08] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:18] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:19] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:20] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:21] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:22] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:25] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:30] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:31] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:33] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:34] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:35] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:36] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:37] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:40] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:41] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:42] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:43] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:46] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:47] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:48] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:49] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:50] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:51] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:53] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:54] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:55] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:57] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:58] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:59] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:00] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:02] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:03] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:05] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:06] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:07] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:09] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:10] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:12] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:13] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:15] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:17] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:18] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:19] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:21] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:27] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -

NOTE 1: once telegraf starts sending back empty responses, it does not recover even if I pause the client (ie, whenever the client resumes, telegraf will only send back empty responses). This is true whether telegraf is restarted or not. Only if I empty the buffer_directory will telegraf start sending back 204s (but it eventually only sends empty responses again)

NOTE 2: in the above I slept 10 seconds after the curl. If I use other values (eg, 11 seconds, 20 seconds, 30 seconds all after starting with an empty buffer_directory), I see the same thing. With a 10 second sleep, it took ~1 minute before seeing the empty responses. With 20 second sleep it took ~2 minutes before seeing empty responses. Since I'm sending 9999 at a time with curl (and telegraf is breaking that up into 1000 chunks), the number of metrics might be a factor (eg, ~60,000 metrics before it starts to show the problem).

NOTE 3: with the above, I haven't introduced 'slowness' with either -f or -d, but I see similar issues when doing so (but might see did not complete within its flush interval from telegraf depending on how slow I make the server). I provided the full script in case it is useful for other debugging and developing a fix for this issue.

The text was updated successfully, but these errors were encountered:

DStrand1 self-assigned this Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer_strategy = "disk" stops working after a while #16314

buffer_strategy = "disk" stops working after a while #16314

jdstrand commented Dec 16, 2024 •

edited

Loading

buffer_strategy = "disk" stops working after a while #16314

buffer_strategy = "disk" stops working after a while #16314

Comments

jdstrand commented Dec 16, 2024 • edited Loading

Background

Reproducer

jdstrand commented Dec 16, 2024 •

edited

Loading