-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved memory usage measurements #12422
Comments
One problem is that the kernel only moves pages between inactive and active state when there is memory pressure. So, you may see bumpy time series of memory usage, with bumps when there is a memory pressure event. Also if you have container A on machine 1 and container B on machine 2, which are from the same replication controller, and actually have very similar memory usage patterns, you could still get very different measurements of usage because machine 1 and machine 2 have different levels of memory pressure, due to some twist of fate. For vertical pod autosizing, you'd like to aggregate data from multiple pods that you think are otherwise identical. But that doesn't work so well when there is this random noise in the data. Another problem is that the kernel arbitrarily tries to classify pages as inactive/active in a 50/50 ratio, all else being equal. Which might not be the right ratio at all. Thanks Greg Thelen for explaining this to me. |
So, what you want is something that applies gentle, steady, consistent pressure to memory, so that you can discover the actual memory needs of a container. About 4 years ago, Linux kernel patches were proposed which did this. They are called idle page tracking / working set estimation. I think they work great. Upstream did not take the patch. Now, a different patch set for idle memory tracking is under discussion. If this is merged, it will be a good thing for us. |
FYI interested people: |
Not sure there is anything to do with this issue until something happens with the kernel patch, and kernels start to pick it up. |
Also, if we decide to do overcommitment separate from vertical autosizing, this will help the scheduler make better decisions. |
Yep. We've discussed creating memory pressure by continuously changing limit. In lieu of userspace OOM delegation, we've discussed using "canary" containers with higher OOM scores. cc @vishh |
@erictune do you have data which shows the results of of the change set(s) across a diversified group of workloads. e.g. what's the net gain on memory utilization for what appears to be a modification for better packing. |
@erictune Thanks for reporting the issue. I mentioned in Resource Management Summit and several other meetings, to have QoS support ready, there are tons of work required at node level, either in kernel, or the userland. This is one of them. Way before we had that kernel patch idle page tracking / working set estimation internally, internal kubelet-ish agent has a userland solution to workaround this issue by applying some "fake" memory pressure:
|
@timothysc Define memory utilization. |
@dchen1107 upstream has |
@erictune You are right, drop_caches is the "equivalent" interface at upstream kernel, but with a lot of performance penalty. We don't use it internal because:
I don't think we pushed the patch of memory.try_to_free_pages to upstream kernel because
Without this kernel patch, we might have to do something smarter to trigger per memcg reclaimer at kubelet before the proper kernel patch is landed without too much cpu cost. Anyway, kubelet / node has to do something to detect and remedy OUT-OF-RESOURCE condition caused by overcommit, especially prevent from a real sys oom situation. Please note that sys oom causes performance downgrade to the entire node. I mentioned this to @AnanyaKumar and @vishh, and resource management summit too. I think this is captured in QoS proposal already. On another side, I don't want node team jump into the issue mentioned here given the current goals and priorities on resource management and autopilot. |
@mikedanese was asking me some details about this today that I didn't know, so might be interested in this issue. |
I think what we have now should be sufficient for auto-scaling if we implement update and allow tasks limit to be updated. The current mechanism will give us a good starting point, and as the load on the machine increases, we can re-calibrate based on newer working-sets. Instead of trying to induce pressure on the nodes locally and affecting performance, we can simply make auto-scalers more aggressive and get more realistic values naturally. We shouldn't be pruning cold pages if cluster as a whole doesn't have more work to do. fwiw, we could abuse 'force_empty' to drop unused memory within a cgroup, but I think its being deprecated. In any case, it feels like worry about true usage is over-optimizing at this stage. |
@rjnagal I agreed with you on priority and roadmap. There are many must-have features at node level to make sure overcommiting a node but still provide QoS in a reliable way. For example, using --parent-cgroup for pod, introducing best-effort cgroup, preventing the node from a real sys oom, prevent a batch job from resource starvation forever, etc. I listed all those options above as workarounds at userspace when required. On another side, using force_empty on a not-to-be-removed memcg group might cause more memory accounting issues. We only use it for speeding up rmdir operation on cgroup and reparent those pages to parent memcg group. This is one of the reasons I want to at API level, we could account those left-over resource usage when processes are killed for pod, so that we don't all reparent to root cgroup. cc/ @bgrant0607 But this is also not high-priority. |
This doesn't look like sig/autoscaling problem. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
A container's memory limit is implemented by setting
/sys/fs/cgroup/memory/$CGROUP_PATH/memory.limit_in_bytes
.In order for a vertical pod autoscaler (#10782) or a human to set a memory limit accurately, an accurate measurement of memory usage is needed.
"Needed" is not the same as "usage". We want to use "Needed" so that, for example, we don't count pages that were read at startup but aren't needed anymore, but haven't been freed because there happened to be no memory pressure on the machine.
Currently, we use cadvisor to tell us this "needed" number, which cadvisor calls "MemoryStats.WorkingSet".
Cadvisor gets this from
memory.stat
, and in particular it subtractstotal_inactive_anon
andtotal_inactive_file
from the total usage, here.This is good, but it has some problems.
The text was updated successfully, but these errors were encountered: