collector.queue-size-memory option doesn't appear to account for all of the memory used by enqueued trace spans

**Describe the bug**
The [documentation](https://www.jaegertracing.io/docs/1.21/performance-tuning/#adjust-queue-size) related to constraining the memory used by the queue is unclear.

We have attempted to use the `collector.queue-size-memory` option (on k8s so via the env variable `COLLECTOR_QUEUE_SIZE_MEMORY`) but our expectations based on the docs were different to how it actually works.

From a user perspective, this option feels like it should allow us to constrain the total memory used by the spans on the queue.

This would be very useful when looking to control the overall memory used by the collector, particularly when changes in the trace spans and their associated metadata/tags could cause unexpected growth in memory usage, ultimately resulting in processes being OOM killed.

**To Reproduce**
Steps to reproduce the behavior:
1. Deploy the jaeger collector to k8s (or equivalent) with
   -  memory request = 200
   -  memory limit = 250
   -  `COLLECTOR_QUEUE_SIZE_MEMORY` = 80
2. Break connectivity with storage backend e.g. Elasticsearch in our case
3. Queue fills up, memory grows to > 250MB, pod OOM killed by k8s

**Expected behavior**
We expected the traces/spans on the queue to be constrained to a memory size of 80MB.

With our configuration this would leave 120MB of memory free, allowing for queue resizing and other memory usage by the process.

Assuming my understanding of this is correct, either the docs should call out the fact that additional memory will be used by items on the queue.

Or if this is in fact a bug, then the calculations of queue size based on spans should also consider the additional memory footprint that doesn't currently appear to be accounted for.

**Version (please complete the following information):**
 - OS: Linux (using the Jaeger Docker images)
 - Jaeger version: 1.21.0
 - Deployment: Kubernetes

**What troubleshooting steps did you try?**
Debugged using pprof, experimented with increased memory to eventually get to a point where the queue maxed out, spans were dropped and the process memory usage stabilised as expected.

For our setup, increasing the memory request to 1GB with the `COLLECTOR_QUEUE_SIZE_MEMORY` = 80MB, allows us to see that when the queue is full it consumes ~750MB RAM.

The following is the heap analysis from pprof for `inuse_space` with the above configuration. As can be observed, the `zipkin.toDomain.getTags` (amongst others) consumes much more than the 80MB we specified.

[heap-1gb-mem9.out.zip](https://github.com/jaegertracing/jaeger/files/5770103/heap-1gb-mem9.out.zip)

![profile-heap-inusespace](https://user-images.githubusercontent.com/5111725/103646394-81967a80-4f51-11eb-8f1c-41e5896df2dd.png)


**Additional context**
Whilst debugging the problem I used pprof to analyse the `inuse_space` on the heap and observed that a large chunk of memory was being used by `zipkin.toDomain.getTags`.

This led me to believe that the 80MB is specifically related to the lower level memory used by the queue and its size but not the additional objects and memory usage associated with items on the queue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collector.queue-size-memory option doesn't appear to account for all of the memory used by enqueued trace spans #2715

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development