Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer_strategy = "disk" stops working after a while #16314

Open
jdstrand opened this issue Dec 16, 2024 · 0 comments
Open

buffer_strategy = "disk" stops working after a while #16314

jdstrand opened this issue Dec 16, 2024 · 0 comments
Assignees

Comments

@jdstrand
Copy link
Contributor

jdstrand commented Dec 16, 2024

Recently I turned on buffer_strategy = "disk" for an application writing line protocol into telegraf's inputs.influxdb_v2_listener and telegraf writing into outputs.influxdb_v2.

  • Telegraf version: 1.33.0 (and 1.32.x before it)
  • amd64 Ubuntu (20.04 and 22.04)

Background

At the time, I was using 1.32 and the setup was:

app.1 ----              ------> InfluxDB Dedicated
          v            /
app.2 --->8186:telegraf ------> local network OSS 2.x
          ^ ^          \
app.3 ----  |           ------> local network OSS 1.x
app.4 ------

This was all working fine with memory. After changing to disk things worked for awhile, but then I started getting client errors of this form:

Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing after 5 retries, giving up"
Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999086358991917)"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing after 5 retries, giving up"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.997719279956073)"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing after 5 retries, giving up"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999581227020826)"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,407 "Error while writing after 5 retries, giving up"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,408 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999311759951524)"

There are no corresponding telegraf logs. The timeout is curious as it is all just under flush_interval which is set to flush_interval = "10s".... (but see below)

Once I set it back to memory, the client errors went away.

As an additional data point, the client was seeing some intermittent network issues which might cause disconnects or slow downs. My goal was to deploy telegraf as a proxy between my clients and InfluxDB with buffer_strategy = "disk" as a safeguard so as not to lose points in the event of network errors and the like.

Reproducer

In an effort to create a simple(r) reproducer, I created a python script to emulate a 'slow server':

#!/usr/bin/env python3

import getopt
from http.server import BaseHTTPRequestHandler, HTTPServer
import random
import sys
import time

class SlowServerHandler(BaseHTTPRequestHandler):
    delay: int = 10
    flaky: int = 0
    def slowdown(self, path: str) -> None:
        if path.startswith("/slow"):
            t = int(path[5:])
            if t == 0:
                return
            if t > 0 and t < 600:  # 1 sec to 10 min
                print("%s: delaying %d sec(s)..." % (path, t))
                time.sleep(t)
                return
            raise ValueError("should be integer >= 0 and <= 600")
        else:
            d = delay
            if self.flaky:
                d = random.randint(0, self.flaky)
            print("%s: delaying %d sec(s)..." % (path, d))
            time.sleep(d)
            return

    def fake_request(self):
        code = 200
        try:
            self.slowdown(self.path)
        except Exception:
            code = 500

        self.send_response(code)
        if code == 200:
            self.send_header("Content-type", "application/json")
        self.end_headers()
        if code == 200:
            try:
                self.wfile.write(b"{}\n")
            except Exception:
                print("%s: broken pipe (client disconnect?)" % self.path)

    def do_GET(self):
        self.fake_request()

    def do_POST(self):
        self.fake_request()

def run(delay: int, port: int, flaky: int):
    # XXX: validate delay and port
    addr = ('', port)
    httpd = HTTPServer(addr, SlowServerHandler)
    if flaky > 0:
        SlowServerHandler.flaky = flaky
    else:
        SlowServerHandler.delay = delay
    httpd.serve_forever()

if __name__ == '__main__':
    global delay
    try:
        opts, rest = getopt.getopt(sys.argv[1:], "hf:d:p:", [])
    except Exception:
        print("Bad args. See -h")
        sys.exit(1)

    port: int = 8000
    delay: int = 10
    flaky: int = 0
    for opt, arg in opts:
        if opt == "-h":
            print("""Usage: slow-server.py ARGS
 -h     this message
 -d N   set response delay in seconds (default=10)
 -p N   listen on port N
 -f N   simulate flaky server (0-N delay on every request)

GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').

If the URL path starts with /slowN, then delay N seconds.
""")
            sys.exit(0)
        elif opt == "-d":
            delay = int(arg)
        elif opt == "-p":
            port = int(arg)
        elif opt == "-f":
            flaky = int(arg)
    try:
        run(delay, port, flaky)
    except KeyboardInterrupt:
        pass

Eg:

$ ./slow-server.py -h
Usage: slow-server.py ARGS
 -h     this message
 -d N   set response delay in seconds (default=10)
 -p N   listen on port N
 -f N   simulate flaky server (0-N delay on every request)

GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').

If the URL path starts with /slowN, then delay N seconds.

I then setup telegraf to listen on port 8086 and write out to the local server and use curl in a loop to write out to the server. Specifically:

  1. In one terminal, start the server without any added delays (ie, it isn't 'slow' yet): $ python3 ./slow-server.py -d 0

  2. Create a telegraf configuration to proxy and use disk buffer

    $ cat ./telegraf.conf
    [agent]
      interval = "10s"
      round_interval = true
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      collection_jitter = "0s"
      flush_interval = "10s"
      flush_jitter = "0s"
      precision = "0s"
      hostname = ""
      omit_hostname = false
    
      # 'memory' or 'disk'
      buffer_strategy = "disk"
      buffer_directory = "/tmp/tg"
    
    # egress to slow server (must serve /api/v2/write)
    [[outputs.influxdb_v2]]
      urls = ["http://127.0.0.1:8000"]
    
    # ingress from curl, etc
    [[inputs.influxdb_v2_listener]]
      service_address = "127.0.0.1:8086"
      read_timeout = "10s"
      write_timeout = "10s"
      max_body_size = 0
    
  3. create the disk buffer directory: mkdir /tmp/tg

  4. start telegraf with telegraf --config ./telegraf.config

    $ ~/code/telegraf --config ./telegraf.conf 
    2024-12-16T21:00:58Z I! Loading config: ./telegraf.conf
    2024-12-16T21:00:58Z W! Using disk buffer strategy for plugin outputs.influxdb_v2, this is an experimental feature
    2024-12-16T21:00:58Z I! WAL file not found for plugin outputs.influxdb_v2 (14c21c01a6d4a9f123336ae9b187b07a76ef66f7fc104dce4060725e0035d58a), this can safely be ignored if you added this plugin instance for the first time
    2024-12-16T21:00:58Z I! Starting Telegraf 1.33.0 brought to you by InfluxData the makers of InfluxDB
    2024-12-16T21:00:58Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
    2024-12-16T21:00:58Z I! Loaded inputs: influxdb_v2_listener
    2024-12-16T21:00:58Z I! Loaded aggregators:
    2024-12-16T21:00:58Z I! Loaded processors:
    2024-12-16T21:00:58Z I! Loaded secretstores:
    2024-12-16T21:00:58Z I! Loaded outputs: influxdb_v2
    2024-12-16T21:00:58Z I! Tags enabled: host=sec
    2024-12-16T21:00:58Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sec", Flush Interval:10s
    2024-12-16T21:00:58Z I! [inputs.influxdb_v2_listener] Started HTTP listener service on 127.0.0.1:8086
    ...at this point I hit Ctrl-c; notice no errors above from telegraf...
    2024-12-16T21:03:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
    2024-12-16T21:03:27Z I! [agent] Stopping running outputs
    
  5. download a line protocol generator: go install github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen@latest

  6. then send line protocol into telegraf in a loop. After a short while, telegraf starts outputting curl: (52) Empty reply from server instead of HTTP/1.1 204 No Content. Eg:

    $ while /bin/true ; do echo "# $(date)" ; ~/go/bin/bulk_data_gen | head -9999 > /tmp/lp && curl -i -X POST http://127.0.0.1:8086/api/v2/write --data-binary @/tmp/lp && sleep 10 ; done
    # Mon Dec 16 15:01:22 CST 2024
    using random seed 232901789
    2024/12/16 15:01:22 Using sampling interval 10s
    2024/12/16 15:01:22 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    HTTP/1.1 204 No Content
    Date: Mon, 16 Dec 2024 21:01:30 GMT
    
    # Mon Dec 16 15:01:40 CST 2024
    using random seed 644661411
    2024/12/16 15:01:40 Using sampling interval 10s
    2024/12/16 15:01:40 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    HTTP/1.1 204 No Content
    Date: Mon, 16 Dec 2024 21:01:49 GMT
    
    # Mon Dec 16 15:01:59 CST 2024
    using random seed 179797224
    2024/12/16 15:01:59 Using sampling interval 10s
    2024/12/16 15:01:59 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    HTTP/1.1 204 No Content
    Date: Mon, 16 Dec 2024 21:02:08 GMT
    
    # Mon Dec 16 15:02:18 CST 2024
    using random seed 729922716
    2024/12/16 15:02:18 Using sampling interval 10s
    2024/12/16 15:02:18 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    curl: (52) Empty reply from server
    # Mon Dec 16 15:02:29 CST 2024
    using random seed 79912443
    2024/12/16 15:02:29 Using sampling interval 10s
    2024/12/16 15:02:29 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    curl: (52) Empty reply from server
    # Mon Dec 16 15:02:40 CST 2024
    using random seed 602442651
    2024/12/16 15:02:40 Using sampling interval 10s
    2024/12/16 15:02:40 Using cardinality of 9
    HTTP/1.1 100 Continue
    
    curl: (52) Empty reply from server
    

With this setup, I can see files getting populated in /tmp/tg. The client curl starts out outputting HTTP/1.1 204 No Content but after about a minute, I start seeing curl: (52) Empty reply from server. I believe this is successfully demonstrating the problem. Interestingly, the python server is showing telegraf writing throughout (it sends 1000 points at a time (due to metric_batch_size = 1000); see the terminal output where the server is running).

$ ./slow-server.py -d 0
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:25] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:27] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:41] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:42] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:43] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:44] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:46] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:47] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:48] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:49] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:01:58] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:00] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:01] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:02] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:03] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:04] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:04] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:05] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:06] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:07] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:08] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:18] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:19] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:20] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:21] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:22] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:25] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:28] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:29] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:30] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:31] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:33] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:34] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:35] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:36] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:37] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:38] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:40] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:41] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:42] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:43] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:45] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:46] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:47] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:48] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:49] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:50] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:51] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:53] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:54] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:55] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:57] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:58] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:02:59] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:00] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:02] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:03] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:05] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:06] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:07] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:09] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:10] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:12] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:13] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:15] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:17] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:18] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:19] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:21] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:23] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:24] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:26] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -
/api/v2/write?bucket=&org=: delaying 0 sec(s)...
127.0.0.1 - - [16/Dec/2024 15:03:27] "POST /api/v2/write?bucket=&org= HTTP/1.1" 200 -

NOTE 1: once telegraf starts sending back empty responses, it does not recover even if I pause the client (ie, whenever the client resumes, telegraf will only send back empty responses). This is true whether telegraf is restarted or not. Only if I empty the buffer_directory will telegraf start sending back 204s (but it eventually only sends empty responses again)

NOTE 2: in the above I slept 10 seconds after the curl. If I use other values (eg, 11 seconds, 20 seconds, 30 seconds all after starting with an empty buffer_directory), I see the same thing. With a 10 second sleep, it took ~1 minute before seeing the empty responses. With 20 second sleep it took ~2 minutes before seeing empty responses. Since I'm sending 9999 at a time with curl (and telegraf is breaking that up into 1000 chunks), the number of metrics might be a factor (eg, ~60,000 metrics before it starts to show the problem).

NOTE 3: with the above, I haven't introduced 'slowness' with either -f or -d, but I see similar issues when doing so (but might see did not complete within its flush interval from telegraf depending on how slow I make the server). I provided the full script in case it is useful for other debugging and developing a fix for this issue.

@DStrand1 DStrand1 self-assigned this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants