Node e2e memory eviction test #28693

mtaufen · 2016-07-08T16:35:01Z

This tests memory evictions.
See related issue #28619 and fix to cadvisor google/cadvisor#1380.

cc @vishh @derekwaynecarr @timstclair

This change is

vishh · 2016-07-11T20:30:00Z

@k8s-bot node e2e test this github issue #IGNORE

pwittrock · 2016-07-14T18:05:36Z

@vishh Reassigning to you since I have a PR backlog

mtaufen · 2016-07-21T18:53:31Z

@vishh PTAL

derekwaynecarr · 2016-08-05T18:46:28Z

test/e2e_node/memory_eviction_test.go

+
+})
+
+func createMemhogPod(f *framework.Framework, genName string, ctnName string, res api.ResourceRequirements) *api.Pod {


so I have an image that uses memhog itself (derekwaynecarr/memhog).

you can see some sample pods using it here:
https://github.com/derekwaynecarr/kubernetes/tree/examples-eviction/demo/kubelet-eviction

maybe we can just have a dedicated image we share?

On that note, I created vish/stress because memhog was a large image and it does not let us control the rate of consumption of memory.

$ docker run --rm vish/stress --help
Usage of /stress:
-alsologtostderr
log to standard error as well as files
-cpus int
total number of CPUs to utilize
-log_backtrace_at value
when logging hits line file:N, emit a stack trace (default :0)
-log_dir string
If non-empty, write log files in this directory
-logtostderr
log to standard error instead of files
-mem-alloc-size string
amount of memory to be consumed in each allocation (default "4Ki")
-mem-alloc-sleep duration
duration to sleep between allocations (default 1ms)
-mem-total string
total memory to be consumed. Memory will be consumed via multiple allocations.
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging

vishh · 2016-08-05T20:18:55Z

@mtaufen If you can address the pod image comments, this PR can go in.

mtaufen · 2016-08-07T00:32:23Z

Rebased and transitioned to @vishh's image. But I just saw a burstable fail before a best-effort while running the test. I used to set --eviction-hard to memory.available<500Mi in e2e_service.go, but now it's set via a flag for the test context (see test/e2e/framework/test_context.go.RegisterNodeFlags), and doesn't appear to be set by default. I think not setting it by default probably results in eviction remaining disabled as a feature for e2e_node tests, and as a result pods chew up memory until they get OOM-killed (which ignores QoS) rather than until they get evicted. I just changed the default to what I used to set in e2e_service.go and am re-testing to see if that helps.

mtaufen · 2016-08-08T15:35:33Z

I've run a few scenarios to start getting a ballpark for the thresholds where I see proper eviction order and where I don't (I'm operating under the assumption that when I don't, it's due to an OOM-kill, but I haven't actually investigated that yet):

Passed:
--eviction-hard=memory.available<500Mi x1
--eviction-hard=memory.available<300Mi x1
--eviction-hard=memory.available<250Mi x1
--eviction-hard=memory.available<230Mi x1
--eviction-hard=memory.available<220Mi x1
--eviction-hard=memory.available<210Mi x1
--eviction-hard=memory.available<205Mi x1

Failed:
--eviction-hard=memory.available<203Mi x1
--eviction-hard=memory.available<201Mi x1
--eviction-hard=memory.available<200Mi x2
--eviction-hard=memory.available<100Mi x1

It seems that failures start to crop up somewhere between a threshold of 200Mi and 230Mi, currently testing 220Mi to see what happens there. Edit: Passed at 220Mi. I'll keep adding results here as I get them.

vishh · 2016-08-08T18:48:14Z

I'm ok with the thresholds being higher for now. We can lower them in a
subsequent PR.

On Mon, Aug 8, 2016 at 8:36 AM, Michael Taufen notifications@github.com
wrote:

I've run a few scenarios to start getting a ballpark for the thresholds
where I see proper eviction order and where I don't (I'm operating under
the assumption that when I don't, it's due to an OOM-kill, but I haven't
actually investigated that yet):

Passed:
--eviction-hard=memory.available<500Mi x1
--eviction-hard=memory.available<300Mi x1
--eviction-hard=memory.available<250Mi x1
--eviction-hard=memory.available<230Mi x1

Failed:
--eviction-hard=memory.available<200Mi x2
--eviction-hard=memory.available<100Mi x1

It seems that failures start to crop up somewhere between a threshold of
200Mi and 230Mi, currently testing 220Mi to see what happens there.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28693 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKIOZkKQBYOoEZun5P4Jr42tSkeHFks5qd0z3gaJpZM4JIMYb
.

This test creates three pods with QoS of besteffort, burstable, and guaranteed, respectively, which each contain a container that tries to consume almost all the available memory at a rate of about 12Mi/10sec. The expectation is that eviction will be initiated when the hard memory.available<250Mi threshold is triggered, and that eviction will proceed in the order of besteffort, then burstable. Since guaranteed pods should only be evicted if something charged to the host uses more resources than were reserved for it, we currently end the test when besteffort and burstable have both been evicted. Note that this commit also sets --eviction-hard=memory.available<250Mi to enable eviction during tests.

mtaufen · 2016-08-08T23:00:35Z

Ok, set to 250Mi for now.

k8s-bot · 2016-08-08T23:40:16Z

GCE e2e build/test passed for commit 736f1cb.

mtaufen · 2016-08-10T17:51:06Z

@vishh PTAL this can probably go in.

k8s-github-robot · 2016-08-10T19:05:16Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2016-08-10T19:06:44Z

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

k8s-bot · 2016-08-10T19:38:50Z

GCE e2e build/test passed for commit 736f1cb.

k8s-bot · 2016-08-10T19:38:54Z

GCE e2e build/test passed for commit 736f1cb.

k8s-github-robot · 2016-08-10T19:39:58Z

Automatic merge from submit-queue

googlebot added the cla: yes label Jul 8, 2016

mtaufen added area/node-e2e priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jul 8, 2016

mtaufen force-pushed the eviction branch from 7e1da3f to 58d7ed1 Compare July 8, 2016 16:37

k8s-github-robot assigned pwittrock Jul 8, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 8, 2016

mtaufen force-pushed the eviction branch from 58d7ed1 to 5d2d522 Compare July 11, 2016 14:54

mtaufen force-pushed the eviction branch 5 times, most recently from 489514e to d88a696 Compare July 12, 2016 17:09

pwittrock assigned vishh and unassigned pwittrock Jul 14, 2016

mtaufen mentioned this pull request Jul 15, 2016

Modify working set memory stats calculation google/cadvisor#1380

Merged

mtaufen force-pushed the eviction branch 3 times, most recently from 08015ae to b0682a3 Compare July 21, 2016 18:52

derekwaynecarr reviewed Aug 5, 2016
View reviewed changes

mtaufen force-pushed the eviction branch from b0682a3 to 8cdcc68 Compare August 7, 2016 00:22

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2016

mtaufen force-pushed the eviction branch from 2644e73 to 736f1cb Compare August 8, 2016 23:00

mtaufen changed the title ~~Add node e2e memory eviction test and temporarily disable e2e_node tests marked Serial~~ Node e2e memory eviction test Aug 8, 2016

mtaufen added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Aug 8, 2016

mtaufen mentioned this pull request Aug 9, 2016

Bug? "This change is Reviewable" button image not loading. Reviewable/Reviewable#383

Closed

vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2016

k8s-github-robot merged commit 42553b9 into kubernetes:master Aug 10, 2016

Random-Liu mentioned this pull request Aug 12, 2016

Node E2E: Memory eviction test cause pods in following test being evicted. #30550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node e2e memory eviction test #28693

Node e2e memory eviction test #28693

mtaufen commented Jul 8, 2016 •

edited

Loading

vishh commented Jul 11, 2016

pwittrock commented Jul 14, 2016

mtaufen commented Jul 21, 2016

derekwaynecarr Aug 5, 2016

vishh Aug 5, 2016

vishh commented Aug 5, 2016

mtaufen commented Aug 7, 2016 •

edited

Loading

mtaufen commented Aug 8, 2016 •

edited

Loading

vishh commented Aug 8, 2016

mtaufen commented Aug 8, 2016

k8s-bot commented Aug 8, 2016

mtaufen commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016

k8s-bot commented Aug 10, 2016

k8s-bot commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016


		})

		func createMemhogPod(f framework.Framework, genName string, ctnName string, res api.ResourceRequirements) api.Pod {

Node e2e memory eviction test #28693

Node e2e memory eviction test #28693

Conversation

mtaufen commented Jul 8, 2016 • edited Loading

vishh commented Jul 11, 2016

pwittrock commented Jul 14, 2016

mtaufen commented Jul 21, 2016

derekwaynecarr Aug 5, 2016

Choose a reason for hiding this comment

vishh Aug 5, 2016

Choose a reason for hiding this comment

vishh commented Aug 5, 2016

mtaufen commented Aug 7, 2016 • edited Loading

mtaufen commented Aug 8, 2016 • edited Loading

vishh commented Aug 8, 2016

mtaufen commented Aug 8, 2016

k8s-bot commented Aug 8, 2016

mtaufen commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016

k8s-bot commented Aug 10, 2016

k8s-bot commented Aug 10, 2016

k8s-github-robot commented Aug 10, 2016

mtaufen commented Jul 8, 2016 •

edited

Loading

mtaufen commented Aug 7, 2016 •

edited

Loading

mtaufen commented Aug 8, 2016 •

edited

Loading