-
Notifications
You must be signed in to change notification settings - Fork 294
High Memory Allocation with snapteld v0.19 #1478
Comments
@daniellee, thanks for the details. What's your expectation? Would you please elaborate? |
@daniellee, those ppof profiles are really helpful. We can easily identify memory hogging methods. We'll look into it. Please feel free to update this issue with your insights anytime. thanks. |
Here is a flame graph of Snap daemon memory heap allocation generated from @daniellee profiles. It brings to light 3 kind of hotspots that consumes a lot of memory. Gob and grpc encoding/decodingHow to highlight it?
How to fix itShort term, Snap workflow is made so every metrics go through Snap, so we need to allocate it. There might be a long term solution if we change Snap workflow to go from a plugin to another, it would save metrics to go back and forth between plugins and Snap. UpdateCacheHow to highlight it?
How to fix it
ToMetricHow to highlight it?
How do fix itUnify the structure It's important to know that all those numbers scale as we scale Snap. @bjray @mjbrender we may need an "area/performance" tag for that. |
@daniellee sent me a CPU profile of snapteld. So now we can quantify the actual loss on performances. Garbage collectorMemory heap hotspot we highlighted will cause more garbage collection. If we search for goRPC gRPC encode / decodeThe CPU usage of Namespace.String()In UpdateCache we previously noticed that ToMetric
Short term we should definitely prioritize optimizing |
There are two ways that come to mind that could address this:
I think it'd be intersting to also test with an all GoRPC/all gRPC set to see what differences(if any) there might be. |
To get around this problem in our production environment, I reduced the amount of metrics we are saving from 5000-10000 metrics every 10 seconds down to 1000 metrics every 10 seconds. This reduced the memory used by Snap from 1-2GB down to 180 - 300MB per instance. @candysmurf you asked about my expectations, so 180MB is much closer to what I would expect for a collector to be using. IMO a collector should have as small a memory and cpu profile as possible. |
@daniellee, thanks for details. It helps. @IRCody & @kindermoumoute, by looking at profiling hotspots, tackling at accumulated metrics inside cache is a good first step. #1 proposed by Cody seems easy and doable right away. Let's do that first and see how much improvement it may bring. Then we can work on future solutions for other areas. |
@cody, agree on #2 as well that's how Oliver and I have been talked about. We can default the separator to the forward slash. If any collector does not want to use that, it can specify its own separator where Snap should honor that. The change should not be significate but document and communication. By looking at inuse profiling graph, it didn't form the hotspot for that method. That's why I was more incline to change cache first. Thinking about it now, if we go with #2, caching issue is auto resolved. Should we go with #2 directly? It seems more ultimate. |
Even though we allow a collector to specify its own separator, some constraints should be enforced to a limited set of characters so it won't bring us further marshall/unmarshal/encode/decode issues. |
OS version: Ubuntu Xenial
Snap version: 0.19
Environment details (virtual, physical, etc.): Kubernetes pod - our Docker container which is based on the latest intelsdi/snap:xenial container.
Steps to reproduce
Using the following plugins:
Collectors:
Publisher:
graphite (version 5)
One task running with the following metrics:
The collection interval is 10 seconds.
The number of metrics being collected is between 5000 and 10000 (to put it another way: 5000 - 10000 series are saved to graphite every 10 seconds).
Actual results
On one kubernetes cluster the snap container is using between 1.3GB and 2.5GB according to the
memory_stats/usage/usage
metric from the docker collector. On the smaller cluster it is using around 600MB. I ran snapteld with pprof on the smaller cluster. The snapteld process was allocated ~300MB and the graphite publisher was allocated ~300MB.Here are the pprof profiles:
In use memory for snapteld:
Allocated memory for snapteld:
Allocated objects for snapteld:
snapteld-alloc_objects.pdf
CPU profile for snapteld, 2 hours:
cpu.pprof.tar.gz
In use memory for the graphite publisher:
inuse.pdf
Allocated memory for the graphite publisher:
alloc.pdf
Result of running: ps aux --sort -rss | head -n 2
Expected results
I'm not sure but the rss memory usage is higher than what I expected.
The text was updated successfully, but these errors were encountered: