Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AppArmor] Hold bad AppArmor pods in pending rather than rejecting #35342

Merged
merged 1 commit into from
Nov 6, 2016

Conversation

timstclair
Copy link

@timstclair timstclair commented Oct 21, 2016

Fixes #32837

Overview of the fix:

If the Kubelet needs to reject a Pod for a reason that the control plane doesn't understand (e.g. which AppArmor profiles are installed on the node), then it might contiinuously try to run the pod on the same rejecting node. This change adds a concept of "soft rejection", in which the Pod is admitted, but not allowed to run (and therefore held in a pending state). This prevents the pod from being retried on other nodes, but also prevents the high churn. This is consistent with how other missing local resources (e.g. volumes) is handled.

A side effect of the change is that Pods which are not initially runnable will be retried. This is desired behavior since it avoids a race condition when a new node is brought up but the AppArmor profiles have not yet been loaded on it.

Pods with invalid AppArmor configurations will be held in a Pending state, rather than rejected (failed). Check the pod status message to find out why it is not running.

@kubernetes/sig-node @timothysc @rrati @davidopp


This change is Reviewable

@timstclair timstclair added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 21, 2016
@timstclair timstclair added this to the v1.5 milestone Oct 21, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins Kubemark GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot kubemark e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE etcd3 e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GKE smoke e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@timstclair
Copy link
Author

Fixed build error.

@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 21, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCE Node e2e failed for commit baf07ad. Full PR test history.

The magic incantation to run this job again is @k8s-bot node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@davidopp
Copy link
Member

This approach seems reasonable. I didn't look at the code, so my questions are probably answered there, but I'm wondering:
(1) how is this state reflected to the user (e.g. via an event or status)?
(2) what happens if kubelet is restarted?
(3) what happens if user deletes the pod through the API server

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2016
@timstclair
Copy link
Author

(1) how is this state reflected to the user (e.g. via an event or status)?

Both. Currently the Kubelet creates an event, and sets the PodStatus to Failed with an appropriate reason & message. With my change, the event is still published, and the reason & message on the PodStatus are still set, but the Pod is kept in the Pending state.

(2) what happens if kubelet is restarted?

It shouldn't change anything. The AppArmor check still needs to pass before the Pod is allowed to run.

(3) what happens if user deletes the pod through the API server

Deletion is handled in a separate loop from where we're checking the AppArmor status. The resources in Kubelet will be cleaned up, and there won't be any containers to kill. This should be unaffected by my change.

Copy link
Member

@dchen1107 dchen1107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some nits comment, otherwise overall I am ok with this change except:

Once this change is in, the compute resource (cpu, memory) was allocated to such pending pods would be hold forever until the upstream layer take the action. I am ok with this for now to prevent uncontrolled churn from the scheduler / control-plane.

@@ -1383,6 +1383,10 @@ func (dm *DockerManager) KillPod(pod *api.Pod, runningPod kubecontainer.Pod, gra

// NOTE(random-liu): The pod passed in could be *nil* when kubelet restarted.
func (dm *DockerManager) killPodWithSyncResult(pod *api.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {
// Short circuit if there's nothing to kill.
if len(runningPod.Containers) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does runningPod here includes PodInfraContainer too? I think it is, but want to be sure here. Otherwise, returning earlier would have podInfraContainer leakage issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. (See line 1424 below). This method is a noop if len(runningPod.Containers) == 0, this is just an optimizitaion.

@@ -1516,6 +1548,29 @@ func (kl *Kubelet) canAdmitPod(pods []*api.Pod, pod *api.Pod) (bool, string, str
return true, "", ""
}

func (kl *Kubelet) canRunPod(pod *api.Pod) lifecycle.PodAdmitResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I say that calling this new method as canRunPod, which is the same as what you are using below: pkg/kubelet/util.go::canRunPod(...) is very confusing to me?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I left a TODO to get rid of that other method. Do you have a suggestion for a better name?

Copy link
Author

@timstclair timstclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dawn,

Once this change is in, the compute resource (cpu, memory) was allocated to such pending pods would be hold forever until the upstream layer take the action. I am ok with this for now to prevent uncontrolled churn from the scheduler / control-plane.

Yes, you're right, but I think this situation is strictly better than what we have today. Long term, we should put some more thought into how we want to deal with this case.

@@ -1383,6 +1383,10 @@ func (dm *DockerManager) KillPod(pod *api.Pod, runningPod kubecontainer.Pod, gra

// NOTE(random-liu): The pod passed in could be *nil* when kubelet restarted.
func (dm *DockerManager) killPodWithSyncResult(pod *api.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {
// Short circuit if there's nothing to kill.
if len(runningPod.Containers) == 0 {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. (See line 1424 below). This method is a noop if len(runningPod.Containers) == 0, this is just an optimizitaion.

@@ -1516,6 +1548,29 @@ func (kl *Kubelet) canAdmitPod(pods []*api.Pod, pod *api.Pod) (bool, string, str
return true, "", ""
}

func (kl *Kubelet) canRunPod(pod *api.Pod) lifecycle.PodAdmitResult {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I left a TODO to get rid of that other method. Do you have a suggestion for a better name?

@timstclair
Copy link
Author

Squashed & rebased.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 1f79ef787e3448b553943fe1efc7695d34d1b85b. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@timstclair
Copy link
Author

Rejenerated hack/update-bazel.sh

@dchen1107
Copy link
Member

LGTM

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 3, 2016
@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High churn from ReplicationController when Pod cannot schedule
7 participants