Skip to content

Compression on v0.14 causes high CPU usage and timeouts #1605

Open
@notslang

Description

I have small highly compressible logs, so I enabled network and disk compression to save some space with my config:

<source>
  @type  forward
  @id    input1
  port  24224
</source>

<match p2pspider.**>
  @type forward
  send_timeout 3m
  recover_wait 10s
  heartbeat_interval 5s
  phi_threshold 16
  hard_timeout 10m
  require_ack_response true
  compress gzip

  <server>
    name coreos3
    host 192.168.1.202
    port 24224
    weight 60
  </server>

  <buffer>
    @type file
    path /fluentd/log/buffer
    retry_max_interval 10m
    retry_forever true
    compress gzip
    chunk_limit_size 5MB
  </buffer>
</match>

<match **>
  @type null
</match>

However, in compressable.rb fluentd seems to compress/decompress each record separately (resulting in a multi-member archive) rather than compressing entire blocks of data at a time. Plus, compression seems to be handled inside of the main thread, rather than in a separate process. I'm not a Ruby expert, so I might be entirely wrong about this, but the way compression is done seems very inefficient.

This results in the fluentd process on the receiving end hitting 100% CPU usage, not sending responses in time, and then going back to a normal load while it waits for another chunk to be sent. This cycle repeats for hours with very few chunks actually being forwarded.

2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7af8fe5f9459644f08038b1268dd" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b08ad849a36fe16bc144c14800d" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b3238ee97c4cd1de5b9e0adff66" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b41eaf04ad6a449acd937e173d0" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b6b76c1c688fd03a6ca13fb3852" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7b7b34168e6326001068f9c29078" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7ba4aadc94dace0f8f1f59e333dc" flushed_at=2017-06-16 18:18:24 +0000
2017-06-16 18:21:36 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bb4b14fc2d79a1356c053eefe70" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7bdde4495b58e988b2e7288972f2" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7beda1f92ed713f60638fbd13b13" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c171dd5578fc76f674ea4a5aa93" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c26ee7ccef4437abb59f3dfe59f" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c505addc388640b50cbc368857d" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c603d9f8f8f70e0fa6235501ed8" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c8993a3f597fccb1a37bbffbe39" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7c9944f3503ea506cb094e7479b5" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 failed to flush the buffer chunk, timeout to commit. chunk_id="551d7cc2ca49e3734d07cb64467a1534" flushed_at=2017-06-16 18:18:25 +0000
2017-06-16 18:21:37 +0000 [warn]: #0 no response from node. regard it as unavailable. host="192.168.1.226" port=24224

When fluentd is at 100% CPU usage, load is divided almost entirely between user space and kernel space work, meaning it's not waiting on a slow filesystem or network.

For comparison, I can use rsync to move logs with compression at about 10MB/sec on the same hardware. Obviously fluentd and rsync are totally different programs, but my point is that there is a way to do compression that doesn't take nearly as much CPU.

Without network compression enabled (but disk compression kept on) the result is largely the same - network timeouts and few chunks being forwarded. However, it seems that the forwarding server has high CPU usage, rather than the receiving server. I haven't tested this configuration much - I just ended up manually recovering the logs from saved chunks.

Without any compression enabled, it seems to work well. CPU usage is around 10-20% on each node and no errors are reported.

  • fluentd version: Docker image fluentd:v0.14 with image id 860efe232d88
  • Environment: Docker running on Arch

Metadata

Assignees

Labels

enhancementFeature request or improve operationsv1

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions