Skip to content

kubelet will never terminate a pod if image registry is unavailable for a pod container #23742

Closed
@derekwaynecarr

Description

After an unbelievable amount of time debugging the following this past week:

#22045
openshift/origin#8176

I discovered that I could reliably cause the kubelet to never delete a pod that was Terminating if the pod had never been started correctly because the docker image pull returned a RegistryUnavailable error.

I verified this by modifying the kubelet to always return RegistryUnavailable for a particular image name. I then created a namespace, added a replication controller that referenced the image, and then watched as the pods got stuck in "Waiting" state but never actually reported a reason. I then deleted the namespace, and was able to see that in-fact the kubelet never properly handled the terminated response.

I was able to fix this issue by modifying:

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/serialized_image_puller.go#L124

to return ErrImagePull when the repository was not available. This change resulted in letting my kubelet properly terminate the pod. I think this is the correct behavior to not special case RegistryUnavailable since in many cases the registry may never become available (its most definitely not a temporary situation).

I suspect this is the potential source of a number of instances of the namespace failed to delete flake because pods remained.

/cc @kubernetes/sig-node @vishh @dchen1107 @yujuhong @kubernetes/rh-cluster-infra

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions