Improved memory usage measurements #12422

erictune · 2015-08-07T22:55:22Z

A container's memory limit is implemented by setting /sys/fs/cgroup/memory/$CGROUP_PATH/memory.limit_in_bytes.

In order for a vertical pod autoscaler (#10782) or a human to set a memory limit accurately, an accurate measurement of memory usage is needed.

"Needed" is not the same as "usage". We want to use "Needed" so that, for example, we don't count pages that were read at startup but aren't needed anymore, but haven't been freed because there happened to be no memory pressure on the machine.

Currently, we use cadvisor to tell us this "needed" number, which cadvisor calls "MemoryStats.WorkingSet".

Cadvisor gets this from memory.stat, and in particular it subtracts total_inactive_anon and total_inactive_file from the total usage, here.

This is good, but it has some problems.

The text was updated successfully, but these errors were encountered:

erictune · 2015-08-07T23:05:15Z

One problem is that the kernel only moves pages between inactive and active state when there is memory pressure.

So, you may see bumpy time series of memory usage, with bumps when there is a memory pressure event.

Also if you have container A on machine 1 and container B on machine 2, which are from the same replication controller, and actually have very similar memory usage patterns, you could still get very different measurements of usage because machine 1 and machine 2 have different levels of memory pressure, due to some twist of fate.

For vertical pod autosizing, you'd like to aggregate data from multiple pods that you think are otherwise identical. But that doesn't work so well when there is this random noise in the data.

Another problem is that the kernel arbitrarily tries to classify pages as inactive/active in a 50/50 ratio, all else being equal. Which might not be the right ratio at all.

Thanks Greg Thelen for explaining this to me.

erictune · 2015-08-07T23:14:19Z

So, what you want is something that applies gentle, steady, consistent pressure to memory, so that you can discover the actual memory needs of a container.

About 4 years ago, Linux kernel patches were proposed which did this. They are called idle page tracking / working set estimation. I think they work great. Upstream did not take the patch.

Now, a different patch set for idle memory tracking is under discussion.

If this is merged, it will be a good thing for us.

erictune · 2015-08-07T23:15:38Z

FYI interested people:
@andreslagarcavilla
@rjnagal
@vmarmol
@dchen1107
@thockin

erictune · 2015-08-07T23:18:22Z

Not sure there is anything to do with this issue until something happens with the kernel patch, and kernels start to pick it up.

erictune · 2015-08-07T23:19:33Z

@bgrant0607

erictune · 2015-08-07T23:22:31Z

Also, if we decide to do overcommitment separate from vertical autosizing, this will help the scheduler make better decisions.

bgrant0607 · 2015-08-08T01:55:24Z

Yep. We've discussed creating memory pressure by continuously changing limit.

In lieu of userspace OOM delegation, we've discussed using "canary" containers with higher OOM scores.

cc @vishh

bgrant0607 · 2015-08-08T01:56:09Z

cc @jszczepkowski @fgrzadkowski

bgrant0607 · 2015-08-08T01:57:17Z

cc @ncdc @eparis @smarterclayton

timothysc · 2015-08-10T15:01:58Z

@erictune do you have data which shows the results of of the change set(s) across a diversified group of workloads.

e.g. what's the net gain on memory utilization for what appears to be a modification for better packing.

dchen1107 · 2015-08-10T16:38:36Z

@erictune Thanks for reporting the issue. I mentioned in Resource Management Summit and several other meetings, to have QoS support ready, there are tons of work required at node level, either in kernel, or the userland. This is one of them.

Way before we had that kernel patch idle page tracking / working set estimation internally, internal kubelet-ish agent has a userland solution to workaround this issue by applying some "fake" memory pressure:

Periodically change memory.soft_limit_in_bytes or memory.limit_in_bytes to trigger memcg reclaimer based on QoS class of a given memcg. Cons: extra cpu usage from kswapd
Periodically call memory.try_to_free_pages to /root to trigger global reclaimer, with minimal cpu cost. Unfortunately memory.try_to_free_pages" is another interface we carried internally. There is another one in upstream kernel which is kind of equivalent with some performance penalty, but I forgot which one.
...

erictune · 2015-08-10T16:55:50Z

@timothysc Define memory utilization.

erictune · 2015-08-10T17:01:42Z

@dchen1107 upstream has drop_caches.
I don't see anything to try to free pages per cgroup

dchen1107 · 2015-08-10T17:44:54Z

@erictune You are right, drop_caches is the "equivalent" interface at upstream kernel, but with a lot of performance penalty. We don't use it internal because:

It is not efficiently triggering reclaim pages
It simply drop caches without any information related to QoS class, thus downgrade the performances of workloads on the node without bias.
No way to specify how many pages required to reclaim / free
...

I don't think we pushed the patch of memory.try_to_free_pages to upstream kernel because

we dropped all workarounds later due to kstaled patch you mentioned above. Along with several other patches, we simplified userland memory resource management significantly.
we don't think upstream kernel would accept this patch since no one is doing such tight control loop at userspace like what we did internally.
memory.try_to_free_pages is such easy patch to maintain internally comparing to other patches.

Without this kernel patch, we might have to do something smarter to trigger per memcg reclaimer at kubelet before the proper kernel patch is landed without too much cpu cost.

Anyway, kubelet / node has to do something to detect and remedy OUT-OF-RESOURCE condition caused by overcommit, especially prevent from a real sys oom situation. Please note that sys oom causes performance downgrade to the entire node. I mentioned this to @AnanyaKumar and @vishh, and resource management summit too. I think this is captured in QoS proposal already.

On another side, I don't want node team jump into the issue mentioned here given the current goals and priorities on resource management and autopilot.

davidopp · 2015-08-10T20:08:57Z

@mikedanese was asking me some details about this today that I didn't know, so might be interested in this issue.

rjnagal · 2015-08-10T20:41:58Z

I think what we have now should be sufficient for auto-scaling if we implement update and allow tasks limit to be updated. The current mechanism will give us a good starting point, and as the load on the machine increases, we can re-calibrate based on newer working-sets. Instead of trying to induce pressure on the nodes locally and affecting performance, we can simply make auto-scalers more aggressive and get more realistic values naturally. We shouldn't be pruning cold pages if cluster as a whole doesn't have more work to do.

fwiw, we could abuse 'force_empty' to drop unused memory within a cgroup, but I think its being deprecated. In any case, it feels like worry about true usage is over-optimizing at this stage.

dchen1107 · 2015-08-10T22:51:21Z

@rjnagal I agreed with you on priority and roadmap. There are many must-have features at node level to make sure overcommiting a node but still provide QoS in a reliable way. For example, using --parent-cgroup for pod, introducing best-effort cgroup, preventing the node from a real sys oom, prevent a batch job from resource starvation forever, etc. I listed all those options above as workarounds at userspace when required.

On another side, using force_empty on a not-to-be-removed memcg group might cause more memory accounting issues. We only use it for speeding up rmdir operation on cgroup and reparent those pages to parent memcg group. This is one of the reasons I want to at API level, we could account those left-over resource usage when processes are killed for pod, so that we don't all reparent to root cgroup. cc/ @bgrant0607 But this is also not high-priority.

mwielgus · 2017-06-21T22:29:50Z

This doesn't look like sig/autoscaling problem.

fejta-bot · 2017-12-29T13:25:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-28T13:34:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-27T14:19:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

erictune added help-wanted sig/node Categorizes an issue or PR as relevant to SIG Node. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Aug 7, 2015

erictune added kind/enhancement area/kubelet area/cadvisor priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 7, 2015

bgrant0607 added the area/isolation label Aug 8, 2015

dchen1107 mentioned this issue Sep 23, 2015

Node becomes unstable when overcommit on memory qos, does not always recover #14452

Closed

erictune mentioned this issue Sep 30, 2015

Memory management documentation and improvement #13006

Closed

yujuhong mentioned this issue Oct 28, 2015

Uneven kubelet memory usage in same cluster #16435

Closed

yujuhong mentioned this issue Nov 6, 2015

RFC: Determine the scalability goal for v1.2 kubelet #16943

Closed

bgrant0607 removed the help-wanted label Aug 30, 2016

euank mentioned this issue Dec 8, 2016

kubelet is not populating the "MemoryReservation" field in Docker container #34244

Closed

mwielgus removed the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jun 21, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2018

k8s-ci-robot closed this as completed Feb 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved memory usage measurements #12422

Improved memory usage measurements #12422

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

bgrant0607 commented Aug 8, 2015

bgrant0607 commented Aug 8, 2015

bgrant0607 commented Aug 8, 2015

timothysc commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

erictune commented Aug 10, 2015

erictune commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

davidopp commented Aug 10, 2015

rjnagal commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

mwielgus commented Jun 21, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented Feb 27, 2018

Improved memory usage measurements #12422

Improved memory usage measurements #12422

Comments

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

erictune commented Aug 7, 2015

bgrant0607 commented Aug 8, 2015

bgrant0607 commented Aug 8, 2015

bgrant0607 commented Aug 8, 2015

timothysc commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

erictune commented Aug 10, 2015

erictune commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

davidopp commented Aug 10, 2015

rjnagal commented Aug 10, 2015

dchen1107 commented Aug 10, 2015

mwielgus commented Jun 21, 2017

fejta-bot commented Dec 29, 2017

fejta-bot commented Jan 28, 2018

fejta-bot commented Feb 27, 2018