Local Ephemeral Storage limit not working #78865

dashpole · 2019-06-10T18:05:27Z

@arunbpt7 opened an issue in kubernetes/enhancements. I am moving it here.
/kind bug
/priority important-longterm
/sig node

As discussed in #361 , Looking for a solution to restrict ephemeral storage for pods usage . As it is found that ephemeral storage is shared across all the pods and that is going to be fill up /var/lib/docker frequently based on the pods writable layer and logs. This is causing the high utilization on /var/lib/docker file system frequently. If there is a solution to restrict ephemeral storage for pods , for an example set a defined size (lets say 20G) for the pods , that particular pods only can use 20G on ephemeral storage and defined persistant volumes for more storage requirement. So that other pods can use available space on /var/lib/docker which again restrict them to use other 20G for each pods.

have defined ephemeral-storage request and limit in resources (spec.hard.requests.ephemeral-storage , spec.hard.limits.ephemeral-storage) on the deployment and verified that evictionHard: is enabled for "imagefs and "nodefs" on the node . but when when deploying the pod and it is not restricting the pod to use the defined ephemeral storage . when creating large file inside the container it is still able to create files more that the ephemeral-storage request and limit.

evictionHard:
imagefs.available: 15%
memory.available: 100Mi
nodefs.available: 10%
nodefs.inodesFree: 5%

containers:

name: busybox
image:
resources:
requests:
ephemeral-storage: "500Mi"
limits:
ephemeral-storage: "500Mi"

dashpole · 2019-06-10T18:06:43Z

Can you show your monitoring that shows that the pod exceeds its limit for an extended period of time (a few minutes?)

arunbpt7 · 2019-06-10T18:14:01Z

containers:
- name: busybox
image:
resources:
requests:
ephemeral-storage: "500Mi"
limits:
ephemeral-storage: "500Mi"

State: Running
Started: Mon, 10 Jun 2019 12:48:57 -0400
Ready: True
Restart Count: 0
Limits:
ephemeral-storage: 500Mi
Requests:
ephemeral-storage: 500Mi
Environment:

kubectl get po busybox-7cc68d968c-mb47z -n testns
NAME READY STATUS RESTARTS AGE
busybox-7cc68d968c-mb47z 1/1 Running 0 82m

kubectl exec -it busybox-7cc68d968c-mb47z -n testns -- bash
bash-4.2$ fallocate -l 2G /var/tmp/test2
bash-4.2$ du -sh /var/tmp/*
1.0G /var/tmp/test
2.0G /var/tmp/test2

bash-4.2$ exit

kubectl get po busybox-7cc68d968c-mb47z -n testns
NAME READY STATUS RESTARTS AGE
busybox-7cc68d968c-mb47z 1/1 Running 0 83m

poidag-zz · 2019-06-13T03:40:32Z

I was able to recreate this issue on a 1.13.5 cluster

For the node that was being tested an EBS volume was attached to the instance and mounted as an xfs volume to /var/lib/docker

The deployment with resources set had a pod scheduled to the node in question.

"Execing" into the pod and running.

fallocate -l 2G /var/tmp/test1

Created a file larger than the set ephemeral storage limit of 500Mi. The pod was not evicted. Even waiting up to 10 minutes.

Starting again with a fresh volume and deployment.

Initially creating a 4G file within the new pod with an underlying volume of 5G mounted on /var/lib/docker

fallocate -l 4G /var/tmp/test1

caused imageGCManager to kick in due to the node DiskPressure condition rather than honouring ephemeral limits and evicting that one pod first

Jun 13 02:47:35 ip-10-0-2-15 kubelet[27133]: W0613 02:47:35.771392 27133 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
Jun 13 02:47:35 ip-10-0-2-15 kubelet[27133]: I0613 02:47:35.771424 27133 container_gc.go:85] attempting to delete unused containers
Jun 13 02:47:35 ip-10-0-2-15 kubelet[27133]: I0613 02:47:35.782369 27133 image_gc_manager.go:317] attempting to delete unused images
Jun 13 02:47:35 ip-10-0-2-15 kubelet[27133]: I0613 02:47:35.794272 27133 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
Jun 13 02:47:35 ip-10-0-2-15 kubelet[27133]: I0613 02:47:35.794493 27133 eviction_manager.go:362] eviction manager: pods ranked for eviction: debug-887cd4775-2brw9_test(d9f7c5ee-8d84-11e9-b987-02f54a20dc4c), canal-js4tm_kube-system(53c05e18-8d66-11e9-b987-02f54a20dc4c), debug-887cd4775-fwvhq_test(a7ac5932-8d84-11e9-b987-02f54a20dc4c), debug-887cd4775-r9zcc_test(0695b54b-8d85-11e9-b987-02f54a20dc4c), debug-887cd4775-l6kxx_test(d18c07b5-8d84-11e9-b987-02f54a20dc4c), debug-887cd4775-rvv

The ranking for eviction, however, was correct and the pod debug-887cd4775-2brw9 was the pod used to create the volume.

msau42 · 2019-06-13T16:40:26Z

cc @kubernetes/sig-storage-bugs @jingxu97

jingxu97 · 2019-06-13T18:26:06Z

@pickledrick @arunbpt7 Could you please share your pod yaml file? You can also email me jinxu at google.com if you prefer. Thanks!

arunbpt7 · 2019-06-13T19:00:26Z

@jingxu97

apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
spec:
replicas: 1
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
securityContext:
runAsUser: 99
fsGroup: 99
containers:
- name: busybox
image:
resources:
requests:
ephemeral-storage: “500Mi”
limits:
ephemeral-storage: “500Mi”

jingxu97 · 2019-06-13T19:31:42Z

@arunbpt7 did you miss some part of yaml file?

poidag-zz · 2019-06-13T22:30:53Z

@jingxu97

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: debug
  name: debug
spec:
  selector:
    matchLabels:
      app: debug
  template:
    metadata:
      labels:
        app: debug
    spec:
      containers:
      - image: quay.io/pickledrick/debug
        imagePullPolicy: Always
        name: debug
        resources:
          limits:
            ephemeral-storage: 500Mi
          requests:
            ephemeral-storage: 500Mi

dashpole · 2019-06-13T23:55:15Z

@arunbpt7 can you query the summary api (localhost:10255/stats/summary) from the node that pod is running on to make sure it is measuring disk space correctly?

dashpole · 2019-06-14T17:37:27Z

Our tests for this are not super consistent: https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=LocalStorageCapacityIsolationEviction, but are mostly green. I'll try and bump the timeout on the serial tests to see if we can get a clearer signal.

arunbpt7 · 2019-06-17T14:52:19Z

@dashpole

curl -s http://localhost:10255/stats/summary

arunbpt7 · 2019-06-24T19:41:38Z

the curl -s http://localhost:10255/stats/summary ran on the node where the pod is running and shows nothing .

dashpole · 2019-06-24T21:35:28Z

It sounds like that is probably your problem then. If you don't have any metrics, the kubelet can't do its monitoring or eviction. Can you share your kubelet logs, or see if there are any errors related to metrics?

arunbpt7 · 2019-06-25T13:49:48Z

@pickledrick , can you share the kubelet logs

poidag-zz · 2019-06-26T08:20:57Z

Hi all,

The insecure stats API appears to be deprecated.

#59666
kubernetes/kubeadm#732

yastij · 2019-06-28T15:25:33Z

I'm not able to reproduce it with 1.13.5 cluster, I'm having the following

debug-887cd4775-ckp4l   0/1     Evicted   0          12m
debug-887cd4775-hlvxw   1/1     Running   0          112s

@pickledrick - if you have access to generated certs you can use the secure port

poidag-zz · 2019-07-08T21:39:34Z

Hi @yastij I'm not able to reproduce. It seems in my environment Evictions are happening eventually. I am sourcing more information to see if there is something else set in original reporters configuration.

arunbpt7 · 2019-07-09T16:14:35Z

@pickledrick

/var/lib/docker is a separate file system apart from node root fs

poidag-zz · 2019-07-09T22:21:28Z

Hi @arunbpt7

Yes, My test environment reflects this. Are you able to confirm the docker version you are using in this environment?

cpearring · 2019-08-13T16:27:22Z

Ran into something similar to this as well, but on a 1.12.8 cluster and using emptyDir. Initially the pod never got evicted, however in subsequent runs evictions worked properly and I haven't run into this since.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: storage-test
  namespace: storage-test
spec:
  template:
    metadata:
      labels:
        app: storage-test
    spec:
      containers:
      - name: storage-test-container
        image: k8s.gcr.io/ubuntu-slim:0.1
        resources:
          requests:
            ephemeral-storage: 10Mi
          limits:
            ephemeral-storage: 10Mi
        command: ["/bin/sh"]
        args: ["-c", "dd if=/dev/urandom of=/cache/file.txt count=100 bs=1048576; sleep 1h"]
        volumeMounts:
        - mountPath: "/cache"
          name: cache-volume
      volumes:
      - name: cache-volume
        emptyDir: {}

RobertKrawitz · 2019-08-16T15:59:02Z

This happens with the writable layer when the runtime directory (e. g. /var/lib/crio, /var/lib/docker) is not on the kubernetes root filesystem; it's due to this code in the eviction manager.

It's not apparent to me why this was done, and the author of the code (in changeset 27901ad) appears to have moved on. I'm planning to open a PR on it.

fejta-bot · 2019-11-14T16:45:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wu0407 · 2019-11-19T14:58:06Z

/remove-lifecycle stale

fejta-bot · 2020-02-17T15:51:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-03-18T16:36:02Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-04-17T17:17:19Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-04-17T17:17:42Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rushins · 2020-04-19T17:52:59Z

hello

i got the same issue on K8S 1.16 and my docker image is 200GB . i have plenty of space in /var/lib/docker (500 GB out 450 GB Free) and i am getting this error. "ephermeral storage " ..can someone tell me what should be the fix.

rushins · 2020-04-19T18:03:46Z

i have this error : The node was low on resource: ephemeral-storage. Container k8stst was using 112619704Ki, which exceeds its request of 0

andreamaruccia · 2020-04-29T16:51:28Z

hm maybe not 100% related to this issue but I had a problem which was due to the fact that I had 2 filesystems. K8S couldn't manage that fact. See kubernetes/enhancements#361 (comment).

So I ended up to mount the filesystem and made docker and kubelet use the same partition with symlinks which solved the issue.

dashpole added the kind/bug Categorizes issue or PR as related to a bug. label Jun 10, 2019

k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 10, 2019

dashpole mentioned this issue Jun 10, 2019

local ephemeral Storage limitation for pods in the cluster kubernetes/enhancements#1094

Closed

Random-Liu mentioned this issue Jun 11, 2019

Feature request: overlayfs quota limitation containerd/containerd#3329

Closed

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 13, 2019

dashpole mentioned this issue Jun 14, 2019

Bump the kubelet node serial timeout to 300m kubernetes/test-infra#13041

Merged

RobertKrawitz mentioned this issue Aug 16, 2019

Issue 81523: do not ignore writable layer when on a device other than rootFS #81516

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 14, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 18, 2020

k8s-ci-robot closed this as completed Apr 17, 2020

pacoxu mentioned this issue Feb 21, 2021

ephemeral-storage limit not working. #99280

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Ephemeral Storage limit not working #78865

Local Ephemeral Storage limit not working #78865

dashpole commented Jun 10, 2019

dashpole commented Jun 10, 2019

arunbpt7 commented Jun 10, 2019

poidag-zz commented Jun 13, 2019

msau42 commented Jun 13, 2019

jingxu97 commented Jun 13, 2019

arunbpt7 commented Jun 13, 2019 •

edited

Loading

jingxu97 commented Jun 13, 2019

poidag-zz commented Jun 13, 2019 •

edited

Loading

dashpole commented Jun 13, 2019

dashpole commented Jun 14, 2019

arunbpt7 commented Jun 17, 2019 •

edited

Loading

arunbpt7 commented Jun 24, 2019

dashpole commented Jun 24, 2019

arunbpt7 commented Jun 25, 2019

poidag-zz commented Jun 26, 2019

yastij commented Jun 28, 2019 •

edited

Loading

poidag-zz commented Jul 8, 2019

arunbpt7 commented Jul 9, 2019

poidag-zz commented Jul 9, 2019

cpearring commented Aug 13, 2019

RobertKrawitz commented Aug 16, 2019 •

edited

Loading

fejta-bot commented Nov 14, 2019

wu0407 commented Nov 19, 2019

fejta-bot commented Feb 17, 2020

fejta-bot commented Mar 18, 2020

fejta-bot commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

rushins commented Apr 19, 2020

rushins commented Apr 19, 2020

andreamaruccia commented Apr 29, 2020 •

edited

Loading

Local Ephemeral Storage limit not working #78865

Local Ephemeral Storage limit not working #78865

Comments

dashpole commented Jun 10, 2019

dashpole commented Jun 10, 2019

arunbpt7 commented Jun 10, 2019

poidag-zz commented Jun 13, 2019

msau42 commented Jun 13, 2019

jingxu97 commented Jun 13, 2019

arunbpt7 commented Jun 13, 2019 • edited Loading

jingxu97 commented Jun 13, 2019

poidag-zz commented Jun 13, 2019 • edited Loading

dashpole commented Jun 13, 2019

dashpole commented Jun 14, 2019

arunbpt7 commented Jun 17, 2019 • edited Loading

curl -s http://localhost:10255/stats/summary

arunbpt7 commented Jun 24, 2019

dashpole commented Jun 24, 2019

arunbpt7 commented Jun 25, 2019

poidag-zz commented Jun 26, 2019

yastij commented Jun 28, 2019 • edited Loading

poidag-zz commented Jul 8, 2019

arunbpt7 commented Jul 9, 2019

poidag-zz commented Jul 9, 2019

cpearring commented Aug 13, 2019

RobertKrawitz commented Aug 16, 2019 • edited Loading

fejta-bot commented Nov 14, 2019

wu0407 commented Nov 19, 2019

fejta-bot commented Feb 17, 2020

fejta-bot commented Mar 18, 2020

fejta-bot commented Apr 17, 2020

k8s-ci-robot commented Apr 17, 2020

rushins commented Apr 19, 2020

rushins commented Apr 19, 2020

andreamaruccia commented Apr 29, 2020 • edited Loading

arunbpt7 commented Jun 13, 2019 •

edited

Loading

poidag-zz commented Jun 13, 2019 •

edited

Loading

arunbpt7 commented Jun 17, 2019 •

edited

Loading

yastij commented Jun 28, 2019 •

edited

Loading

RobertKrawitz commented Aug 16, 2019 •

edited

Loading

andreamaruccia commented Apr 29, 2020 •

edited

Loading