Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet eviction on inode exhaustion #30311

Merged
merged 3 commits into from
Aug 18, 2016

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Aug 9, 2016

Add support for kubelet to monitor for inode exhaustion of either image or rootfs, and in response, attempt to reclaim node level resources and/or evict pods.


This change is Reviewable

@derekwaynecarr derekwaynecarr added this to the v1.4 milestone Aug 9, 2016
@derekwaynecarr derekwaynecarr added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 9, 2016
@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 9, 2016
@derekwaynecarr
Copy link
Member Author

/cc @vishh @ronnielai - per our chat today.

A few things I am trying to reason through:

  1. we have no way of knowing the number of inodes reclaimed in response to image gc.
  2. i am not sure if i am double counting inode consumption, more eyes appreciated.

The issue with (1) is that we may have actually reclaimed enough inodes to not need to kill a pod.

@derekwaynecarr
Copy link
Member Author

I am double counting pod inode consumption, but its double counting consistently (logs and rootfs), so I think it doesn't make a material impact.

@ronnielai
Copy link
Contributor

If inode consumption is double counted, eviction-minimum-reclaim calculation will get impacted, won't it?

@derekwaynecarr
Copy link
Member Author

@ronnielai - the double counting is happening when calculating per pod usage primarily between the rootfs and logfs. in this case, the counting is then used to choose the greediest consumer, but as long as the counting is consistent, we will choose the greediest. for min-reclaim, that is evaluated relative to the actual rootfs or imagefs inodesFree value, so no double counting is there. i guess what i am trying to figure is if there is any benefit to counting per pod inodes for anything that is not the rootfs. thoughts?

@@ -41,10 +41,16 @@ const (
message = "The node was low on compute resources."
// disk, in bytes. internal to this module, used to account for local disk usage.
resourceDisk api.ResourceName = "disk"
// inodes, in bytes. internal to this module, used to account for local disk inode consumption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment (and the following ones) says the units of inodes are in bytes, which seems incorrect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I got logically tripped up when using df -ih. In practice, it is actually quite useful to express these in powers of 1024 ;-)

$ curl <stats>
     "rootfs": {
      "availableBytes": 15132864512,
      "capacityBytes": 52710469632,
      "inodesFree": 2592837,
      "inodes": 3276800
     },
$ df -i
Filesystem                Inodes  IUsed    IFree IUse% Mounted on
/dev/mapper/fedora-root  3276800 683963  2592837   21% /
$ df -ih
Filesystem              Inodes IUsed IFree IUse% Mounted on
/dev/mapper/fedora-root   3.2M  668K  2.5M   21% /

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still says "in bytes" :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i swear i changed it to say "number" at some point. will fix up.

@ronnielai
Copy link
Contributor

ronnielai commented Aug 11, 2016

Please add inode testing in TestMakeSignalObservations() and TestThresholdsMet().

LGTM with nits

@ronnielai ronnielai self-assigned this Aug 11, 2016
@derekwaynecarr
Copy link
Member Author

I was experimenting with this more, and realized the following:

## HOST MACHINE
$ df -i
Filesystem                Inodes  IUsed    IFree IUse% Mounted on
/dev/mapper/fedora-root  3276800 714759  2562041   22% /
...
## INSIDE CONTAINER (overlay driver)
$ df -i       
Filesystem               Inodes  IUsed   IFree IUse% Mounted on
overlay                 3276800 714759 2562041   22% /
tmpfs                   1498577     18 1498559    1% /dev
tmpfs                   1498577     16 1498561    1% /sys/fs/cgroup
/dev/mapper/fedora-root 3276800 714759 2562041   22% /etc/hosts
shm                     1498577      1 1498576    1% /dev/shm
tmpfs                   1498577      9 1498568    1% /run/secrets/kubernetes.io/serviceaccount

Notice that the / inode consumption for the container matches the host. This means if I have multiple containers on a node, trying to rank pods by greediest consumer of inodes does not work with existing cAdvisor support. In practice, our pods will be ranked by QOS only unless we had a way of knowing the number of inodes unique to docker image + COW layer per container.

GIven that disk is best-effort in 1.4, are we ok with that behavior until we figure out a way to return meaningful values? Given that inodes consumption is independent of disk usage (i.e. touch foo.txt uses as much as 15GB file), it seems ranking pods by core QOS measures may be optimal behavior for now?

/cc @vishh @ronnielai

@derekwaynecarr
Copy link
Member Author

Unless we want to run find / -type f | wc -l inside the container fs which is just awful ;-)

@ronnielai
Copy link
Contributor

I think we can put that as a limitation in https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/kubelet-eviction.md for now. @vishh, what do you think?

@vishh
Copy link
Contributor

vishh commented Aug 15, 2016

I'm totally fine with just using QoS for inodes exhaustion. Given that tracking inodes is not possible across all storage drivers, let's not do it for now.

@vishh
Copy link
Contributor

vishh commented Aug 15, 2016

@derekwaynecarr we can have cadvisor track inodes similar to how it tracks per-container disk usage.

@derekwaynecarr
Copy link
Member Author

I will update this PR and poke when ready.

@derekwaynecarr derekwaynecarr changed the title WIP: kubelet eviction on inode exhaustion kubelet eviction on inode exhaustion Aug 16, 2016
@derekwaynecarr
Copy link
Member Author

@vishh @ronnielai - updated. right now, left place-holder to report 0 for inode_usage per container to make it a one-liner update when we get per container stats in cadvisor. added test cases. should be good to go.

@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Aug 16, 2016
@vishh
Copy link
Contributor

vishh commented Aug 16, 2016

Excepting the pending bytes comment, this pr lgtm.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2016
@derekwaynecarr derekwaynecarr removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2016
@derekwaynecarr
Copy link
Member Author

rebased, fixed the byte lingering comment, tagging pr for merge.

@derekwaynecarr derekwaynecarr added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 17, 2016
@derekwaynecarr
Copy link
Member Author

@k8s-bot test this issue #30462

@derekwaynecarr
Copy link
Member Author

Added a commit now that we pass golint to get past jenkins verification error. Retagging.

@derekwaynecarr derekwaynecarr added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 17, 2016
@k8s-bot
Copy link

k8s-bot commented Aug 17, 2016

GCE e2e build/test passed for commit 8261520.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Aug 18, 2016

GCE e2e build/test passed for commit 8261520.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit ff58d04 into kubernetes:master Aug 18, 2016
@vishh
Copy link
Contributor

vishh commented Aug 18, 2016

㊗️ @derekwaynecarr & @ronnielai

@ronnielai
Copy link
Contributor

#21546

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Automatic merge from submit-queue

kubelet eviction on inode exhaustion

Add support for kubelet to monitor for inode exhaustion of either image or rootfs, and in response, attempt to reclaim node level resources and/or evict pods.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants