Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node e2e memory eviction test #28693

Merged
merged 1 commit into from
Aug 10, 2016
Merged

Conversation

mtaufen
Copy link
Contributor

@mtaufen mtaufen commented Jul 8, 2016

This tests memory evictions.
See related issue #28619 and fix to cadvisor google/cadvisor#1380.

cc @vishh @derekwaynecarr @timstclair


This change is Reviewable

@mtaufen mtaufen added area/node-e2e priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jul 8, 2016
@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 8, 2016
@vishh
Copy link
Contributor

vishh commented Jul 11, 2016

@k8s-bot node e2e test this github issue #IGNORE

@mtaufen mtaufen force-pushed the eviction branch 5 times, most recently from 489514e to d88a696 Compare July 12, 2016 17:09
@pwittrock pwittrock assigned vishh and unassigned pwittrock Jul 14, 2016
@pwittrock
Copy link
Member

@vishh Reassigning to you since I have a PR backlog

@mtaufen
Copy link
Contributor Author

mtaufen commented Jul 21, 2016

@vishh PTAL

@k8s-github-robot k8s-github-robot added do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. retest-not-required-docs-only needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed retest-not-required-docs-only labels Jul 28, 2016

})

func createMemhogPod(f *framework.Framework, genName string, ctnName string, res api.ResourceRequirements) *api.Pod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I have an image that uses memhog itself (derekwaynecarr/memhog).

you can see some sample pods using it here:
https://github.com/derekwaynecarr/kubernetes/tree/examples-eviction/demo/kubelet-eviction

maybe we can just have a dedicated image we share?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, I created vish/stress because memhog was a large image and it does not let us control the rate of consumption of memory.

$ docker run --rm vish/stress --help
Usage of /stress:
-alsologtostderr
log to standard error as well as files
-cpus int
total number of CPUs to utilize
-log_backtrace_at value
when logging hits line file:N, emit a stack trace (default :0)
-log_dir string
If non-empty, write log files in this directory
-logtostderr
log to standard error instead of files
-mem-alloc-size string
amount of memory to be consumed in each allocation (default "4Ki")
-mem-alloc-sleep duration
duration to sleep between allocations (default 1ms)
-mem-total string
total memory to be consumed. Memory will be consumed via multiple allocations.
-stderrthreshold value
logs at or above this threshold go to stderr
-v value
log level for V logs
-vmodule value
comma-separated list of pattern=N settings for file-filtered logging

@vishh
Copy link
Contributor

vishh commented Aug 5, 2016

@mtaufen If you can address the pod image comments, this PR can go in.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2016
@mtaufen
Copy link
Contributor Author

mtaufen commented Aug 7, 2016

Rebased and transitioned to @vishh's image. But I just saw a burstable fail before a best-effort while running the test. I used to set --eviction-hard to memory.available<500Mi in e2e_service.go, but now it's set via a flag for the test context (see test/e2e/framework/test_context.go.RegisterNodeFlags), and doesn't appear to be set by default. I think not setting it by default probably results in eviction remaining disabled as a feature for e2e_node tests, and as a result pods chew up memory until they get OOM-killed (which ignores QoS) rather than until they get evicted. I just changed the default to what I used to set in e2e_service.go and am re-testing to see if that helps.

@mtaufen
Copy link
Contributor Author

mtaufen commented Aug 8, 2016

I've run a few scenarios to start getting a ballpark for the thresholds where I see proper eviction order and where I don't (I'm operating under the assumption that when I don't, it's due to an OOM-kill, but I haven't actually investigated that yet):

Passed:
--eviction-hard=memory.available<500Mi x1
--eviction-hard=memory.available<300Mi x1
--eviction-hard=memory.available<250Mi x1
--eviction-hard=memory.available<230Mi x1
--eviction-hard=memory.available<220Mi x1
--eviction-hard=memory.available<210Mi x1
--eviction-hard=memory.available<205Mi x1

Failed:
--eviction-hard=memory.available<203Mi x1
--eviction-hard=memory.available<201Mi x1
--eviction-hard=memory.available<200Mi x2
--eviction-hard=memory.available<100Mi x1

It seems that failures start to crop up somewhere between a threshold of 200Mi and 230Mi, currently testing 220Mi to see what happens there. Edit: Passed at 220Mi. I'll keep adding results here as I get them.

@vishh
Copy link
Contributor

vishh commented Aug 8, 2016

I'm ok with the thresholds being higher for now. We can lower them in a
subsequent PR.

On Mon, Aug 8, 2016 at 8:36 AM, Michael Taufen notifications@github.com
wrote:

I've run a few scenarios to start getting a ballpark for the thresholds
where I see proper eviction order and where I don't (I'm operating under
the assumption that when I don't, it's due to an OOM-kill, but I haven't
actually investigated that yet):

Passed:
--eviction-hard=memory.available<500Mi x1
--eviction-hard=memory.available<300Mi x1
--eviction-hard=memory.available<250Mi x1
--eviction-hard=memory.available<230Mi x1

Failed:
--eviction-hard=memory.available<200Mi x2
--eviction-hard=memory.available<100Mi x1

It seems that failures start to crop up somewhere between a threshold of
200Mi and 230Mi, currently testing 220Mi to see what happens there.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28693 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKIOZkKQBYOoEZun5P4Jr42tSkeHFks5qd0z3gaJpZM4JIMYb
.

This test creates three pods with QoS of besteffort, burstable, and
guaranteed, respectively, which each contain a container that tries to
consume almost all the available memory at a rate of about 12Mi/10sec.

The expectation is that eviction will be initiated when the hard
memory.available<250Mi threshold is triggered, and that eviction will proceed
in the order of besteffort, then burstable. Since guaranteed pods should
only be evicted if something charged to the host uses more resources
than were reserved for it, we currently end the test when besteffort and
burstable have both been evicted.

Note that this commit also sets --eviction-hard=memory.available<250Mi
to enable eviction during tests.
@mtaufen
Copy link
Contributor Author

mtaufen commented Aug 8, 2016

Ok, set to 250Mi for now.

@mtaufen mtaufen changed the title Add node e2e memory eviction test and temporarily disable e2e_node tests marked Serial Node e2e memory eviction test Aug 8, 2016
@mtaufen mtaufen added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. release-note-label-needed labels Aug 8, 2016
@k8s-bot
Copy link

k8s-bot commented Aug 8, 2016

GCE e2e build/test passed for commit 736f1cb.

@mtaufen
Copy link
Contributor Author

mtaufen commented Aug 10, 2016

@vishh PTAL this can probably go in.

@vishh vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

@k8s-bot
Copy link

k8s-bot commented Aug 10, 2016

GCE e2e build/test passed for commit 736f1cb.

@k8s-bot
Copy link

k8s-bot commented Aug 10, 2016

GCE e2e build/test passed for commit 736f1cb.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/node-e2e lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants