Compression on v0.14 causes high CPU usage and timeouts #1605
Description
I have small highly compressible logs, so I enabled network and disk compression to save some space with my config:
<source>
@type forward
@id input1
port 24224
</source>
<match p2pspider.**>
@type forward
send_timeout 3m
recover_wait 10s
heartbeat_interval 5s
phi_threshold 16
hard_timeout 10m
require_ack_response true
compress gzip
<server>
name coreos3
host 192.168.1.202
port 24224
weight 60
</server>
<buffer>
@type file
path /fluentd/log/buffer
retry_max_interval 10m
retry_forever true
compress gzip
chunk_limit_size 5MB
</buffer>
</match>
<match **>
@type null
</match>
However, in compressable.rb fluentd seems to compress/decompress each record separately (resulting in a multi-member archive) rather than compressing entire blocks of data at a time. Plus, compression seems to be handled inside of the main thread, rather than in a separate process. I'm not a Ruby expert, so I might be entirely wrong about this, but the way compression is done seems very inefficient.
This results in the fluentd process on the receiving end hitting 100% CPU usage, not sending responses in time, and then going back to a normal load while it waits for another chunk to be sent. This cycle repeats for hours with very few chunks actually being forwarded.
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7af8fe5f9459644f08038b1268dd" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b08ad849a36fe16bc144c14800d" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b3238ee97c4cd1de5b9e0adff66" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b41eaf04ad6a449acd937e173d0" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b6b76c1c688fd03a6ca13fb3852" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b7b34168e6326001068f9c29078" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7ba4aadc94dace0f8f1f59e333dc" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bb4b14fc2d79a1356c053eefe70" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bdde4495b58e988b2e7288972f2" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7beda1f92ed713f60638fbd13b13" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c171dd5578fc76f674ea4a5aa93" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c26ee7ccef4437abb59f3dfe59f" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c505addc388640b50cbc368857d" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c603d9f8f8f70e0fa6235501ed8" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c8993a3f597fccb1a37bbffbe39" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c9944f3503ea506cb094e7479b5" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7cc2ca49e3734d07cb64467a1534" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
When fluentd is at 100% CPU usage, load is divided almost entirely between user space and kernel space work, meaning it's not waiting on a slow filesystem or network.
For comparison, I can use rsync to move logs with compression at about 10MB/sec on the same hardware. Obviously fluentd and rsync are totally different programs, but my point is that there is a way to do compression that doesn't take nearly as much CPU.
Without network compression enabled (but disk compression kept on) the result is largely the same - network timeouts and few chunks being forwarded. However, it seems that the forwarding server has high CPU usage, rather than the receiving server. I haven't tested this configuration much - I just ended up manually recovering the logs from saved chunks.
Without any compression enabled, it seems to work well. CPU usage is around 10-20% on each node and no errors are reported.
- fluentd version: Docker image fluentd:v0.14 with image id 860efe232d88
- Environment: Docker running on Arch