Skip to content

collector.queue-size-memory option doesn't appear to account for all of the memory used by enqueued trace spansΒ #2715

Open
@nickebbitt

Description

Describe the bug
The documentation related to constraining the memory used by the queue is unclear.

We have attempted to use the collector.queue-size-memory option (on k8s so via the env variable COLLECTOR_QUEUE_SIZE_MEMORY) but our expectations based on the docs were different to how it actually works.

From a user perspective, this option feels like it should allow us to constrain the total memory used by the spans on the queue.

This would be very useful when looking to control the overall memory used by the collector, particularly when changes in the trace spans and their associated metadata/tags could cause unexpected growth in memory usage, ultimately resulting in processes being OOM killed.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy the jaeger collector to k8s (or equivalent) with
    • memory request = 200
    • memory limit = 250
    • COLLECTOR_QUEUE_SIZE_MEMORY = 80
  2. Break connectivity with storage backend e.g. Elasticsearch in our case
  3. Queue fills up, memory grows to > 250MB, pod OOM killed by k8s

Expected behavior
We expected the traces/spans on the queue to be constrained to a memory size of 80MB.

With our configuration this would leave 120MB of memory free, allowing for queue resizing and other memory usage by the process.

Assuming my understanding of this is correct, either the docs should call out the fact that additional memory will be used by items on the queue.

Or if this is in fact a bug, then the calculations of queue size based on spans should also consider the additional memory footprint that doesn't currently appear to be accounted for.

Version (please complete the following information):

  • OS: Linux (using the Jaeger Docker images)
  • Jaeger version: 1.21.0
  • Deployment: Kubernetes

What troubleshooting steps did you try?
Debugged using pprof, experimented with increased memory to eventually get to a point where the queue maxed out, spans were dropped and the process memory usage stabilised as expected.

For our setup, increasing the memory request to 1GB with the COLLECTOR_QUEUE_SIZE_MEMORY = 80MB, allows us to see that when the queue is full it consumes ~750MB RAM.

The following is the heap analysis from pprof for inuse_space with the above configuration. As can be observed, the zipkin.toDomain.getTags (amongst others) consumes much more than the 80MB we specified.

heap-1gb-mem9.out.zip

profile-heap-inusespace

Additional context
Whilst debugging the problem I used pprof to analyse the inuse_space on the heap and observed that a large chunk of memory was being used by zipkin.toDomain.getTags.

This led me to believe that the 80MB is specifically related to the lower level memory used by the queue and its size but not the additional objects and memory usage associated with items on the queue.

Metadata

Assignees

No one assigned

    Labels

    bughelp wantedFeatures that maintainers are willing to accept but do not have cycles to implement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions