You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I turned on buffer_strategy = "disk" for an application writing line protocol into telegraf's inputs.influxdb_v2_listener and telegraf writing into outputs.influxdb_v2.
Telegraf version: 1.33.0 (and 1.32.x before it)
amd64 Ubuntu (20.04 and 22.04)
Background
At the time, I was using 1.32 and the setup was:
app.1 ---- ------> InfluxDB Dedicated
v /
app.2 --->8186:telegraf ------> local network OSS 2.x
^ ^ \
app.3 ---- | ------> local network OSS 1.x
app.4 ------
This was all working fine with memory. After changing to disk things worked for awhile, but then I started getting client errors of this form:
Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing after 5 retries, giving up"
Nov 13 17:18:26 host1 gunicorn[406]: ERROR:app.1: 2024-11-13 17:18:26,082 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999086358991917)"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing after 5 retries, giving up"
Nov 13 17:19:59 host1 gunicorn[434]: ERROR:app.4: 2024-11-13 17:19:59,665 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.997719279956073)"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing after 5 retries, giving up"
Nov 13 17:20:56 host1 gunicorn[410]: ERROR:app.3: 2024-11-13 17:20:56,406 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999581227020826)"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,407 "Error while writing after 5 retries, giving up"
Nov 13 17:21:22 host1 gunicorn[408]: ERROR:app.2: 2024-11-13 17:21:22,408 "Error while writing: HTTPConnectionPool(host='localhost', port=8186): Read timed out. (read timeout=9.999311759951524)"
There are no corresponding telegraf logs. The timeout is curious as it is all just under flush_interval which is set to flush_interval = "10s".... (but see below)
Once I set it back to memory, the client errors went away.
As an additional data point, the client was seeing some intermittent network issues which might cause disconnects or slow downs. My goal was to deploy telegraf as a proxy between my clients and InfluxDB with buffer_strategy = "disk" as a safeguard so as not to lose points in the event of network errors and the like.
Reproducer
In an effort to create a simple(r) reproducer, I created a python script to emulate a 'slow server':
#!/usr/bin/env python3
import getopt
from http.server import BaseHTTPRequestHandler, HTTPServer
import random
import sys
import time
class SlowServerHandler(BaseHTTPRequestHandler):
delay: int = 10
flaky: int = 0
def slowdown(self, path: str) -> None:
if path.startswith("/slow"):
t = int(path[5:])
if t == 0:
return
if t > 0 and t < 600: # 1 sec to 10 min
print("%s: delaying %d sec(s)..." % (path, t))
time.sleep(t)
return
raise ValueError("should be integer >= 0 and <= 600")
else:
d = delay
if self.flaky:
d = random.randint(0, self.flaky)
print("%s: delaying %d sec(s)..." % (path, d))
time.sleep(d)
return
def fake_request(self):
code = 200
try:
self.slowdown(self.path)
except Exception:
code = 500
self.send_response(code)
if code == 200:
self.send_header("Content-type", "application/json")
self.end_headers()
if code == 200:
try:
self.wfile.write(b"{}\n")
except Exception:
print("%s: broken pipe (client disconnect?)" % self.path)
def do_GET(self):
self.fake_request()
def do_POST(self):
self.fake_request()
def run(delay: int, port: int, flaky: int):
# XXX: validate delay and port
addr = ('', port)
httpd = HTTPServer(addr, SlowServerHandler)
if flaky > 0:
SlowServerHandler.flaky = flaky
else:
SlowServerHandler.delay = delay
httpd.serve_forever()
if __name__ == '__main__':
global delay
try:
opts, rest = getopt.getopt(sys.argv[1:], "hf:d:p:", [])
except Exception:
print("Bad args. See -h")
sys.exit(1)
port: int = 8000
delay: int = 10
flaky: int = 0
for opt, arg in opts:
if opt == "-h":
print("""Usage: slow-server.py ARGS
-h this message
-d N set response delay in seconds (default=10)
-p N listen on port N
-f N simulate flaky server (0-N delay on every request)
GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').
If the URL path starts with /slowN, then delay N seconds.
""")
sys.exit(0)
elif opt == "-d":
delay = int(arg)
elif opt == "-p":
port = int(arg)
elif opt == "-f":
flaky = int(arg)
try:
run(delay, port, flaky)
except KeyboardInterrupt:
pass
Eg:
$ ./slow-server.py -h
Usage: slow-server.py ARGS
-h this message
-d N set response delay in seconds (default=10)
-p N listen on port N
-f N simulate flaky server (0-N delay on every request)
GET and POST are supported an all paths and simply return an empty JSON
document after N delay seconds (as configured with '-d').
If the URL path starts with /slowN, then delay N seconds.
I then setup telegraf to listen on port 8086 and write out to the local server and use curl in a loop to write out to the server. Specifically:
In one terminal, start the server without any added delays (ie, it isn't 'slow' yet): $ python3 ./slow-server.py -d 0
Create a telegraf configuration to proxy and use disk buffer
start telegraf with telegraf --config ./telegraf.config
$ ~/code/telegraf --config ./telegraf.conf
2024-12-16T21:00:58Z I! Loading config: ./telegraf.conf
2024-12-16T21:00:58Z W! Using disk buffer strategy for plugin outputs.influxdb_v2, this is an experimental feature
2024-12-16T21:00:58Z I! WAL file not found for plugin outputs.influxdb_v2 (14c21c01a6d4a9f123336ae9b187b07a76ef66f7fc104dce4060725e0035d58a), this can safely be ignored if you added this plugin instance for the first time
2024-12-16T21:00:58Z I! Starting Telegraf 1.33.0 brought to you by InfluxData the makers of InfluxDB
2024-12-16T21:00:58Z I! Available plugins: 236 inputs, 9 aggregators, 33 processors, 26 parsers, 63 outputs, 6 secret-stores
2024-12-16T21:00:58Z I! Loaded inputs: influxdb_v2_listener
2024-12-16T21:00:58Z I! Loaded aggregators:
2024-12-16T21:00:58Z I! Loaded processors:
2024-12-16T21:00:58Z I! Loaded secretstores:
2024-12-16T21:00:58Z I! Loaded outputs: influxdb_v2
2024-12-16T21:00:58Z I! Tags enabled: host=sec
2024-12-16T21:00:58Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sec", Flush Interval:10s
2024-12-16T21:00:58Z I! [inputs.influxdb_v2_listener] Started HTTP listener service on 127.0.0.1:8086
...at this point I hit Ctrl-c; notice no errors above from telegraf...
2024-12-16T21:03:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-12-16T21:03:27Z I! [agent] Stopping running outputs
download a line protocol generator: go install github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen@latest
then send line protocol into telegraf in a loop. After a short while, telegraf starts outputting curl: (52) Empty reply from server instead of HTTP/1.1 204 No Content. Eg:
$ while /bin/true ; do echo "# $(date)" ; ~/go/bin/bulk_data_gen | head -9999 > /tmp/lp && curl -i -X POST http://127.0.0.1:8086/api/v2/write --data-binary @/tmp/lp && sleep 10 ; done
# Mon Dec 16 15:01:22 CST 2024
using random seed 232901789
2024/12/16 15:01:22 Using sampling interval 10s
2024/12/16 15:01:22 Using cardinality of 9
HTTP/1.1 100 Continue
HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:01:30 GMT
# Mon Dec 16 15:01:40 CST 2024
using random seed 644661411
2024/12/16 15:01:40 Using sampling interval 10s
2024/12/16 15:01:40 Using cardinality of 9
HTTP/1.1 100 Continue
HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:01:49 GMT
# Mon Dec 16 15:01:59 CST 2024
using random seed 179797224
2024/12/16 15:01:59 Using sampling interval 10s
2024/12/16 15:01:59 Using cardinality of 9
HTTP/1.1 100 Continue
HTTP/1.1 204 No Content
Date: Mon, 16 Dec 2024 21:02:08 GMT
# Mon Dec 16 15:02:18 CST 2024
using random seed 729922716
2024/12/16 15:02:18 Using sampling interval 10s
2024/12/16 15:02:18 Using cardinality of 9
HTTP/1.1 100 Continue
curl: (52) Empty reply from server
# Mon Dec 16 15:02:29 CST 2024
using random seed 79912443
2024/12/16 15:02:29 Using sampling interval 10s
2024/12/16 15:02:29 Using cardinality of 9
HTTP/1.1 100 Continue
curl: (52) Empty reply from server
# Mon Dec 16 15:02:40 CST 2024
using random seed 602442651
2024/12/16 15:02:40 Using sampling interval 10s
2024/12/16 15:02:40 Using cardinality of 9
HTTP/1.1 100 Continue
curl: (52) Empty reply from server
With this setup, I can see files getting populated in /tmp/tg. The client curl starts out outputting HTTP/1.1 204 No Content but after about a minute, I start seeing curl: (52) Empty reply from server. I believe this is successfully demonstrating the problem. Interestingly, the python server is showing telegraf writing throughout (it sends 1000 points at a time (due to metric_batch_size = 1000); see the terminal output where the server is running).
NOTE 1: once telegraf starts sending back empty responses, it does not recover even if I pause the client (ie, whenever the client resumes, telegraf will only send back empty responses). This is true whether telegraf is restarted or not. Only if I empty the buffer_directory will telegraf start sending back 204s (but it eventually only sends empty responses again)
NOTE 2: in the above I slept 10 seconds after the curl. If I use other values (eg, 11 seconds, 20 seconds, 30 seconds all after starting with an empty buffer_directory), I see the same thing. With a 10 second sleep, it took ~1 minute before seeing the empty responses. With 20 second sleep it took ~2 minutes before seeing empty responses. Since I'm sending 9999 at a time with curl (and telegraf is breaking that up into 1000 chunks), the number of metrics might be a factor (eg, ~60,000 metrics before it starts to show the problem).
NOTE 3: with the above, I haven't introduced 'slowness' with either -f or -d, but I see similar issues when doing so (but might see did not complete within its flush interval from telegraf depending on how slow I make the server). I provided the full script in case it is useful for other debugging and developing a fix for this issue.
The text was updated successfully, but these errors were encountered:
Recently I turned on
buffer_strategy = "disk"
for an application writing line protocol into telegraf'sinputs.influxdb_v2_listener
and telegraf writing intooutputs.influxdb_v2
.Background
At the time, I was using 1.32 and the setup was:
This was all working fine with
memory
. After changing todisk
things worked for awhile, but then I started getting client errors of this form:There are no corresponding telegraf logs. The timeout is curious as it is all just under flush_interval which is set to
flush_interval = "10s"
.... (but see below)Once I set it back to
memory
, the client errors went away.As an additional data point, the client was seeing some intermittent network issues which might cause disconnects or slow downs. My goal was to deploy telegraf as a proxy between my clients and InfluxDB with
buffer_strategy = "disk"
as a safeguard so as not to lose points in the event of network errors and the like.Reproducer
In an effort to create a simple(r) reproducer, I created a python script to emulate a 'slow server':
Eg:
I then setup telegraf to listen on port 8086 and write out to the local server and use
curl
in a loop to write out to the server. Specifically:In one terminal, start the server without any added delays (ie, it isn't 'slow' yet):
$ python3 ./slow-server.py -d 0
Create a telegraf configuration to proxy and use disk buffer
create the disk buffer directory:
mkdir /tmp/tg
start telegraf with
telegraf --config ./telegraf.config
download a line protocol generator:
go install github.com/influxdata/influxdb-comparisons/cmd/bulk_data_gen@latest
then send line protocol into telegraf in a loop. After a short while, telegraf starts outputting
curl: (52) Empty reply from server
instead ofHTTP/1.1 204 No Content
. Eg:With this setup, I can see files getting populated in
/tmp/tg
. The clientcurl
starts out outputtingHTTP/1.1 204 No Content
but after about a minute, I start seeingcurl: (52) Empty reply from server
. I believe this is successfully demonstrating the problem. Interestingly, the python server is showing telegraf writing throughout (it sends 1000 points at a time (due tometric_batch_size = 1000
); see the terminal output where the server is running).NOTE 1: once telegraf starts sending back empty responses, it does not recover even if I pause the client (ie, whenever the client resumes, telegraf will only send back empty responses). This is true whether telegraf is restarted or not. Only if I empty the
buffer_directory
will telegraf start sending back 204s (but it eventually only sends empty responses again)NOTE 2: in the above I slept 10 seconds after the curl. If I use other values (eg, 11 seconds, 20 seconds, 30 seconds all after starting with an empty
buffer_directory
), I see the same thing. With a 10 second sleep, it took ~1 minute before seeing the empty responses. With 20 second sleep it took ~2 minutes before seeing empty responses. Since I'm sending 9999 at a time with curl (and telegraf is breaking that up into 1000 chunks), the number of metrics might be a factor (eg, ~60,000 metrics before it starts to show the problem).NOTE 3: with the above, I haven't introduced 'slowness' with either
-f
or-d
, but I see similar issues when doing so (but might seedid not complete within its flush interval
from telegraf depending on how slow I make the server). I provided the full script in case it is useful for other debugging and developing a fix for this issue.The text was updated successfully, but these errors were encountered: