-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AppArmor] Hold bad AppArmor pods in pending rather than rejecting #35342
Conversation
Jenkins GCI GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins Kubemark GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE etcd3 e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins GKE smoke e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GKE smoke e2e failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Jenkins unit/integration failed for commit 515bf5488bbb37a9d5af8c5fc0089ecf6bad5588. Full PR test history. The magic incantation to run this job again is |
Fixed build error. |
Jenkins GCE Node e2e failed for commit baf07ad. Full PR test history. The magic incantation to run this job again is |
This approach seems reasonable. I didn't look at the code, so my questions are probably answered there, but I'm wondering: |
Both. Currently the Kubelet creates an event, and sets the PodStatus to Failed with an appropriate reason & message. With my change, the event is still published, and the reason & message on the PodStatus are still set, but the Pod is kept in the Pending state.
It shouldn't change anything. The AppArmor check still needs to pass before the Pod is allowed to run.
Deletion is handled in a separate loop from where we're checking the AppArmor status. The resources in Kubelet will be cleaned up, and there won't be any containers to kill. This should be unaffected by my change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some nits comment, otherwise overall I am ok with this change except:
Once this change is in, the compute resource (cpu, memory) was allocated to such pending pods would be hold forever until the upstream layer take the action. I am ok with this for now to prevent uncontrolled churn from the scheduler / control-plane.
@@ -1383,6 +1383,10 @@ func (dm *DockerManager) KillPod(pod *api.Pod, runningPod kubecontainer.Pod, gra | |||
|
|||
// NOTE(random-liu): The pod passed in could be *nil* when kubelet restarted. | |||
func (dm *DockerManager) killPodWithSyncResult(pod *api.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) { | |||
// Short circuit if there's nothing to kill. | |||
if len(runningPod.Containers) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does runningPod here includes PodInfraContainer too? I think it is, but want to be sure here. Otherwise, returning earlier would have podInfraContainer leakage issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. (See line 1424 below). This method is a noop if len(runningPod.Containers) == 0
, this is just an optimizitaion.
@@ -1516,6 +1548,29 @@ func (kl *Kubelet) canAdmitPod(pods []*api.Pod, pod *api.Pod) (bool, string, str | |||
return true, "", "" | |||
} | |||
|
|||
func (kl *Kubelet) canRunPod(pod *api.Pod) lifecycle.PodAdmitResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I say that calling this new method as canRunPod, which is the same as what you are using below: pkg/kubelet/util.go::canRunPod(...) is very confusing to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I left a TODO to get rid of that other method. Do you have a suggestion for a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dawn,
Once this change is in, the compute resource (cpu, memory) was allocated to such pending pods would be hold forever until the upstream layer take the action. I am ok with this for now to prevent uncontrolled churn from the scheduler / control-plane.
Yes, you're right, but I think this situation is strictly better than what we have today. Long term, we should put some more thought into how we want to deal with this case.
@@ -1383,6 +1383,10 @@ func (dm *DockerManager) KillPod(pod *api.Pod, runningPod kubecontainer.Pod, gra | |||
|
|||
// NOTE(random-liu): The pod passed in could be *nil* when kubelet restarted. | |||
func (dm *DockerManager) killPodWithSyncResult(pod *api.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) { | |||
// Short circuit if there's nothing to kill. | |||
if len(runningPod.Containers) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does. (See line 1424 below). This method is a noop if len(runningPod.Containers) == 0
, this is just an optimizitaion.
@@ -1516,6 +1548,29 @@ func (kl *Kubelet) canAdmitPod(pods []*api.Pod, pod *api.Pod) (bool, string, str | |||
return true, "", "" | |||
} | |||
|
|||
func (kl *Kubelet) canRunPod(pod *api.Pod) lifecycle.PodAdmitResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I left a TODO to get rid of that other method. Do you have a suggestion for a better name?
Squashed & rebased. |
Jenkins verification failed for commit 1f79ef787e3448b553943fe1efc7695d34d1b85b. Full PR test history. The magic incantation to run this job again is |
Rejenerated |
LGTM |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Fixes #32837
Overview of the fix:
If the Kubelet needs to reject a Pod for a reason that the control plane doesn't understand (e.g. which AppArmor profiles are installed on the node), then it might contiinuously try to run the pod on the same rejecting node. This change adds a concept of "soft rejection", in which the Pod is admitted, but not allowed to run (and therefore held in a pending state). This prevents the pod from being retried on other nodes, but also prevents the high churn. This is consistent with how other missing local resources (e.g. volumes) is handled.
A side effect of the change is that Pods which are not initially runnable will be retried. This is desired behavior since it avoids a race condition when a new node is brought up but the AppArmor profiles have not yet been loaded on it.
@kubernetes/sig-node @timothysc @rrati @davidopp
This change is