Compression on v0.14 causes high CPU usage and timeouts

I have small highly compressible logs, so I enabled network and disk compression to save some space with my config:

```
<source>
  @type  forward
  @id    input1
  port  24224
</source>

<match p2pspider.**>
  @type forward
  send_timeout 3m
  recover_wait 10s
  heartbeat_interval 5s
  phi_threshold 16
  hard_timeout 10m
  require_ack_response true
  compress gzip

  <server>
    name coreos3
    host 192.168.1.202
    port 24224
    weight 60
  </server>

  <buffer>
    @type file
    path /fluentd/log/buffer
    retry_max_interval 10m
    retry_forever true
    compress gzip
    chunk_limit_size 5MB
  </buffer>
</match>

<match **>
  @type null
</match>
```

However, in [compressable.rb](https://github.com/fluent/fluentd/blob/2befef01ea7bbc0e1dad47694512675975c06091/lib/fluent/plugin/compressable.rb) fluentd seems to compress/decompress each record separately (resulting in a multi-member archive) rather than compressing entire blocks of data at a time. Plus, compression seems to be handled inside of the main thread, rather than in a separate process. I'm not a Ruby expert, so I might be entirely wrong about this, but the way compression is done seems very inefficient.

This results in the fluentd process on the receiving end hitting 100% CPU usage, not sending responses in time, and then going back to a normal load while it waits for another chunk to be sent. This cycle repeats for hours with very few chunks actually being forwarded.

```
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7af8fe5f9459644f08038b1268dd" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b08ad849a36fe16bc144c14800d" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b3238ee97c4cd1de5b9e0adff66" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b41eaf04ad6a449acd937e173d0" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b6b76c1c688fd03a6ca13fb3852" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b7b34168e6326001068f9c29078" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7ba4aadc94dace0f8f1f59e333dc" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bb4b14fc2d79a1356c053eefe70" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bdde4495b58e988b2e7288972f2" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7beda1f92ed713f60638fbd13b13" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c171dd5578fc76f674ea4a5aa93" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c26ee7ccef4437abb59f3dfe59f" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c505addc388640b50cbc368857d" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c603d9f8f8f70e0fa6235501ed8" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c8993a3f597fccb1a37bbffbe39" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c9944f3503ea506cb094e7479b5" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7cc2ca49e3734d07cb64467a1534" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
```

When fluentd is at 100% CPU usage, load is divided almost entirely between user space and kernel space work, meaning it's not waiting on a slow filesystem or network.

For comparison, I can use rsync to move logs with compression at about 10MB/sec on the same hardware. Obviously fluentd and rsync are totally different programs, but my point is that there is a way to do compression that doesn't take nearly as much CPU.

Without network compression enabled (but disk compression kept on) the result is largely the same - network timeouts and few chunks being forwarded. However, it seems that the forwarding server has high CPU usage, rather than the receiving server. I haven't tested this configuration much - I just ended up manually recovering the logs from saved chunks.

Without any compression enabled, it seems to work well. CPU usage is around 10-20% on each node and no errors are reported.

- fluentd version: Docker image fluentd:v0.14 with image id 860efe232d88
- Environment: Docker running on Arch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression on v0.14 causes high CPU usage and timeouts #1605

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development