Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved memory usage measurements #12422

Closed
erictune opened this issue Aug 7, 2015 · 21 comments
Closed

Improved memory usage measurements #12422

erictune opened this issue Aug 7, 2015 · 21 comments
Labels
area/cadvisor area/isolation area/kubelet lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@erictune
Copy link
Member

erictune commented Aug 7, 2015

A container's memory limit is implemented by setting /sys/fs/cgroup/memory/$CGROUP_PATH/memory.limit_in_bytes.

In order for a vertical pod autoscaler (#10782) or a human to set a memory limit accurately, an accurate measurement of memory usage is needed.

"Needed" is not the same as "usage". We want to use "Needed" so that, for example, we don't count pages that were read at startup but aren't needed anymore, but haven't been freed because there happened to be no memory pressure on the machine.

Currently, we use cadvisor to tell us this "needed" number, which cadvisor calls "MemoryStats.WorkingSet".

Cadvisor gets this from memory.stat, and in particular it subtracts total_inactive_anon and total_inactive_file from the total usage, here.

This is good, but it has some problems.

@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

One problem is that the kernel only moves pages between inactive and active state when there is memory pressure.

So, you may see bumpy time series of memory usage, with bumps when there is a memory pressure event.

Also if you have container A on machine 1 and container B on machine 2, which are from the same replication controller, and actually have very similar memory usage patterns, you could still get very different measurements of usage because machine 1 and machine 2 have different levels of memory pressure, due to some twist of fate.

For vertical pod autosizing, you'd like to aggregate data from multiple pods that you think are otherwise identical. But that doesn't work so well when there is this random noise in the data.

Another problem is that the kernel arbitrarily tries to classify pages as inactive/active in a 50/50 ratio, all else being equal. Which might not be the right ratio at all.

Thanks Greg Thelen for explaining this to me.

@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

So, what you want is something that applies gentle, steady, consistent pressure to memory, so that you can discover the actual memory needs of a container.

About 4 years ago, Linux kernel patches were proposed which did this. They are called idle page tracking / working set estimation. I think they work great. Upstream did not take the patch.

Now, a different patch set for idle memory tracking is under discussion.

If this is merged, it will be a good thing for us.

@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

FYI interested people:
@andreslagarcavilla
@rjnagal
@vmarmol
@dchen1107
@thockin

@erictune erictune added help-wanted sig/node Categorizes an issue or PR as relevant to SIG Node. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Aug 7, 2015
@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

Not sure there is anything to do with this issue until something happens with the kernel patch, and kernels start to pick it up.

@erictune erictune added kind/enhancement area/kubelet area/cadvisor priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 7, 2015
@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

@bgrant0607

@erictune
Copy link
Member Author

erictune commented Aug 7, 2015

Also, if we decide to do overcommitment separate from vertical autosizing, this will help the scheduler make better decisions.

@bgrant0607
Copy link
Member

Yep. We've discussed creating memory pressure by continuously changing limit.

In lieu of userspace OOM delegation, we've discussed using "canary" containers with higher OOM scores.

cc @vishh

@bgrant0607
Copy link
Member

cc @jszczepkowski @fgrzadkowski

@bgrant0607
Copy link
Member

cc @ncdc @eparis @smarterclayton

@timothysc
Copy link
Member

@erictune do you have data which shows the results of of the change set(s) across a diversified group of workloads.

e.g. what's the net gain on memory utilization for what appears to be a modification for better packing.

@dchen1107
Copy link
Member

@erictune Thanks for reporting the issue. I mentioned in Resource Management Summit and several other meetings, to have QoS support ready, there are tons of work required at node level, either in kernel, or the userland. This is one of them.

Way before we had that kernel patch idle page tracking / working set estimation internally, internal kubelet-ish agent has a userland solution to workaround this issue by applying some "fake" memory pressure:

  1. Periodically change memory.soft_limit_in_bytes or memory.limit_in_bytes to trigger memcg reclaimer based on QoS class of a given memcg. Cons: extra cpu usage from kswapd
  2. Periodically call memory.try_to_free_pages to /root to trigger global reclaimer, with minimal cpu cost. Unfortunately memory.try_to_free_pages" is another interface we carried internally. There is another one in upstream kernel which is kind of equivalent with some performance penalty, but I forgot which one.
  3. ...

@erictune
Copy link
Member Author

@timothysc Define memory utilization.

@erictune
Copy link
Member Author

@dchen1107 upstream has drop_caches.
I don't see anything to try to free pages per cgroup

@dchen1107
Copy link
Member

@erictune You are right, drop_caches is the "equivalent" interface at upstream kernel, but with a lot of performance penalty. We don't use it internal because:

  1. It is not efficiently triggering reclaim pages
  2. It simply drop caches without any information related to QoS class, thus downgrade the performances of workloads on the node without bias.
  3. No way to specify how many pages required to reclaim / free
    ...

I don't think we pushed the patch of memory.try_to_free_pages to upstream kernel because

  1. we dropped all workarounds later due to kstaled patch you mentioned above. Along with several other patches, we simplified userland memory resource management significantly.
  2. we don't think upstream kernel would accept this patch since no one is doing such tight control loop at userspace like what we did internally.
  3. memory.try_to_free_pages is such easy patch to maintain internally comparing to other patches.

Without this kernel patch, we might have to do something smarter to trigger per memcg reclaimer at kubelet before the proper kernel patch is landed without too much cpu cost.

Anyway, kubelet / node has to do something to detect and remedy OUT-OF-RESOURCE condition caused by overcommit, especially prevent from a real sys oom situation. Please note that sys oom causes performance downgrade to the entire node. I mentioned this to @AnanyaKumar and @vishh, and resource management summit too. I think this is captured in QoS proposal already.

On another side, I don't want node team jump into the issue mentioned here given the current goals and priorities on resource management and autopilot.

@davidopp
Copy link
Member

@mikedanese was asking me some details about this today that I didn't know, so might be interested in this issue.

@rjnagal
Copy link
Contributor

rjnagal commented Aug 10, 2015

I think what we have now should be sufficient for auto-scaling if we implement update and allow tasks limit to be updated. The current mechanism will give us a good starting point, and as the load on the machine increases, we can re-calibrate based on newer working-sets. Instead of trying to induce pressure on the nodes locally and affecting performance, we can simply make auto-scalers more aggressive and get more realistic values naturally. We shouldn't be pruning cold pages if cluster as a whole doesn't have more work to do.

fwiw, we could abuse 'force_empty' to drop unused memory within a cgroup, but I think its being deprecated. In any case, it feels like worry about true usage is over-optimizing at this stage.

@dchen1107
Copy link
Member

@rjnagal I agreed with you on priority and roadmap. There are many must-have features at node level to make sure overcommiting a node but still provide QoS in a reliable way. For example, using --parent-cgroup for pod, introducing best-effort cgroup, preventing the node from a real sys oom, prevent a batch job from resource starvation forever, etc. I listed all those options above as workarounds at userspace when required.

On another side, using force_empty on a not-to-be-removed memcg group might cause more memory accounting issues. We only use it for speeding up rmdir operation on cgroup and reparent those pages to parent memcg group. This is one of the reasons I want to at API level, we could account those left-over resource usage when processes are killed for pod, so that we don't all reparent to root cgroup. cc/ @bgrant0607 But this is also not high-priority.

@mwielgus
Copy link
Contributor

This doesn't look like sig/autoscaling problem.

@mwielgus mwielgus removed the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jun 21, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cadvisor area/isolation area/kubelet lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

9 participants