Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubelet] Fix oom-score-adj policy in kubelet #32251

Merged
merged 3 commits into from
Sep 15, 2016

Conversation

vishh
Copy link
Contributor

@vishh vishh commented Sep 8, 2016

Docker daemon and kubelet needs to be protected by setting oom-score-adj to -999.
Fixes #32238


This change is Reviewable

@vishh vishh added this to the v1.4 milestone Sep 8, 2016
@vishh
Copy link
Contributor Author

vishh commented Sep 8, 2016

The tests need some more work. Posting the PR for feedback while I fix the tests.

@vishh vishh added kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 8, 2016
@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2016
@dchen1107 dchen1107 removed the kind/design Categorizes issue or PR as related to design. label Sep 8, 2016
@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Sep 8, 2016
@vishh
Copy link
Contributor Author

vishh commented Sep 8, 2016

@dchen1107 The tests have been fixed. I verified the fix manually as well. Improved logging along the way. PTAL. This PR is ready to go!

@vishh vishh removed the kind/design Categorizes issue or PR as related to design. label Sep 8, 2016
@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Sep 8, 2016
@dchen1107
Copy link
Member

Review status: 0 of 10 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


test/e2e_node/container_manager_test.go, line 55 [r1] (raw file):

}

func validateOOMScoreAdjSettingIsInRange(pid int, expectedMinOOMScoreAdj, expectedMaxOOMScoreAdj int32) {

Can I have one more test case: validate docker's oom_score_adj, then restart docker daemon, revalidate?


Comments from Reviewable

@vishh
Copy link
Contributor Author

vishh commented Sep 9, 2016

Review status: 0 of 10 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


test/e2e_node/container_manager_test.go, line 55 [r1] (raw file):

Previously, dchen1107 (Dawn Chen) wrote…

Can I have one more test case: validate docker's oom_score_adj, then restart docker daemon, revalidate?

I tested this manually. Can we unblock this PR and add that test in a subsequent PR? I'm don't have uninterrupted time now to add that to this PR.

Comments from Reviewable

@dchen1107
Copy link
Member

Ok, could you please file a separate issue for node e2e test too? The test are failed but not due to your pr.

LGTM

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2016
@vishh
Copy link
Contributor Author

vishh commented Sep 11, 2016

@k8s-bot test this github issue #IGNORE

@rmmh
Copy link
Contributor

rmmh commented Sep 11, 2016

@k8s-bot gke test this issue #32431

@dchen1107
Copy link
Member

@vishh I think some recent test-infra changes completely break builds related to GCI (also coreos).

@dchen1107
Copy link
Member

@k8s-bot test this github issue #IGNORE

@vishh
Copy link
Contributor Author

vishh commented Sep 12, 2016

The node e2e tests are passing when I run manually. Puzzling. Looking into cluster level tests now

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2016
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh
Copy link
Contributor Author

vishh commented Sep 14, 2016

@k8s-bot test e2e github issue #32548

@vishh vishh removed the kind/design Categorizes issue or PR as related to design. label Sep 15, 2016
@k8s-github-robot k8s-github-robot added the kind/design Categorizes issue or PR as related to design. label Sep 15, 2016
@vishh
Copy link
Contributor Author

vishh commented Sep 15, 2016

@k8s-bot test e2e github issue #32548

@eparis
Copy link
Contributor

eparis commented Sep 15, 2016

@k8s-bot e2e test this issue: #32548

@k8s-bot
Copy link

k8s-bot commented Sep 15, 2016

GCE e2e build/test passed for commit ba6feb2.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@pwittrock
Copy link
Member

@vishh This seems pretty risky for this late in the cycle

@k8s-bot
Copy link

k8s-bot commented Sep 15, 2016

GCE e2e build/test passed for commit ba6feb2.

@saad-ali
Copy link
Member

Manually merging per @vishh. The failing unit test etcd TestWatch appears to be unrelated (#32770).

@saad-ali saad-ali merged commit cb88d88 into kubernetes:master Sep 15, 2016
@eparis
Copy link
Contributor

eparis commented Sep 15, 2016

I'm not comfortable with it really. Why did we change from using mnt namespace to using pid namespace to determine 'is in container' ? In any case, this does not seem to be fixing something I'd consider a blocker bug. Yes, it stinks if docker get oomed, and it'll render the node pretty much useless, but the system as a whole will recover. This is not a regression vs 1.3. I suggest we either don't fix in 1.4 or we punt to 1.4.1.

@vishh
Copy link
Contributor Author

vishh commented Sep 16, 2016

We have been having this regression since v1.3. It is critical for GKE/GCE deployments of k8s because docker daemon has a high likelihood of being OOM killed which will end up nuking all containers.
The reason for moving from mnt to pid is that docker daemon moves itself into a new mnt namespace :(

@vishh
Copy link
Contributor Author

vishh commented Sep 16, 2016

@pwittrock This is fixing a scale reliability regression that will be visible to customers. The priority call belongs to you and @eparis.

@eparis
Copy link
Contributor

eparis commented Sep 16, 2016

I'm oging to remove the cherrypick-candidate label since this broke master. You can suggest a pick of the replacement, but my fears are not abated.

@vishh
Copy link
Contributor Author

vishh commented Sep 16, 2016

Yeah. SGTM!

On Thu, Sep 15, 2016 at 8:50 PM, Eric Paris notifications@github.com
wrote:

I'm oging to remove the cherrypick-candidate label since this broke
master. You can suggest a pick of the replacement, but my fears are not
abated.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#32251 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKEX5nliF3TPc3wdtdfxpBvjSWWxwks5qqhIegaJpZM4J3hfy
.

k8s-github-robot pushed a commit that referenced this pull request Sep 16, 2016
Automatic merge from submit-queue

Add separate build process for node test.

This PR is part of #31093. However, because currently node e2e is built on `KUBE_TEST_PLATFORMS`, which includes linux/amd64, darwin/amd64, windows/amd64 and linux/arm, it caused #32251 to fail.

In fact, node e2e is running on the same node with kubelet, and it also has built-in apiserver, etcd and namespace controller. All of them are only built on `KUBE_SERVER_PLATFORMS`, so node e2e should also only be built on those platforms.
```
KUBE_SERVER_PLATFORMS=(
     linux/amd64
     linux/arm
     linux/arm64
   )
```

This PR added a separate build process for node e2e to address this.

@vishh Do you need this for v1.4? because this blocks your #32251. /cc @dchen1107
eparis pushed a commit to eparis/kubernetes that referenced this pull request Sep 17, 2016
…r-node-e2e

Automatic merge from submit-queue

Add separate build process for node test.

This PR is part of kubernetes#31093. However, because currently node e2e is built on `KUBE_TEST_PLATFORMS`, which includes linux/amd64, darwin/amd64, windows/amd64 and linux/arm, it caused kubernetes#32251 to fail.

In fact, node e2e is running on the same node with kubelet, and it also has built-in apiserver, etcd and namespace controller. All of them are only built on `KUBE_SERVER_PLATFORMS`, so node e2e should also only be built on those platforms.
```
KUBE_SERVER_PLATFORMS=(
     linux/amd64
     linux/arm
     linux/arm64
   )
```

This PR added a separate build process for node e2e to address this.

@vishh Do you need this for v1.4? because this blocks your kubernetes#32251. /cc @dchen1107
(cherry picked from commit dae3bdd)
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…r-node-e2e

Automatic merge from submit-queue

Add separate build process for node test.

This PR is part of kubernetes#31093. However, because currently node e2e is built on `KUBE_TEST_PLATFORMS`, which includes linux/amd64, darwin/amd64, windows/amd64 and linux/arm, it caused kubernetes#32251 to fail.

In fact, node e2e is running on the same node with kubelet, and it also has built-in apiserver, etcd and namespace controller. All of them are only built on `KUBE_SERVER_PLATFORMS`, so node e2e should also only be built on those platforms.
```
KUBE_SERVER_PLATFORMS=(
     linux/amd64
     linux/arm
     linux/arm64
   )
```

This PR added a separate build process for node e2e to address this.

@vishh Do you need this for v1.4? because this blocks your kubernetes#32251. /cc @dchen1107
(cherry picked from commit dae3bdd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants