PLEG: reinspect pods that failed prior inspections #25077

ncdc · 2016-05-03T15:14:41Z

Fix the following sequence of events:

relist call 1 successfully inspects a pod (just has infra container)
relist call 2 gets an error inspecting the same pod (has infra container and a transient
container that failed to create) and doesn't update the old/new pod records
relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
anything changed)

This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.

Fixes #24819

cc @kubernetes/rh-cluster-infra @kubernetes/sig-node

Fix the following sequence of events: 1. relist call 1 successfully inspects a pod (just has infra container) 1. relist call 2 gets an error inspecting the same pod (has infra container and a transient container that failed to create) and doesn't update the old/new pod records 1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like anything changed) This change adds a new list that keeps track of pods that failed inspection and retries them the next time relist is called. Without this change, a pod in this state would never be inspected again, its entry in the status cache would never be updated, and the pod worker would never call syncPod again because the most recent entry in the status cache has an error associated with it. Without this change, pods in this state would be stuck Terminating forever, unless the user issued a deletion with a grace period value of 0.

ncdc · 2016-05-04T13:13:50Z

@yujuhong PTAL, thanks!

yujuhong · 2016-05-04T23:56:23Z

LGTM. Thanks!

kubelet will try to create the container infinitely with this PR, but I think that's expected and is consistent with other type of failures.

k8s-bot · 2016-05-05T16:34:00Z

GCE e2e build/test passed for commit 3a87bfb.

k8s-github-robot · 2016-05-07T04:35:56Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-07T05:14:04Z

GCE e2e build/test passed for commit 3a87bfb.

k8s-github-robot · 2016-05-07T05:14:08Z

Automatic merge from submit-queue

googlebot added the cla: yes label May 3, 2016

ncdc added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 3, 2016

k8s-github-robot assigned yujuhong May 3, 2016

k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 3, 2016

yujuhong added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2016

ncdc added this to the v1.3 milestone May 5, 2016

pweil- mentioned this pull request May 6, 2016

PLEG: reinspect pods that failed prior inspections openshift/origin#8778

Merged

k8s-github-robot merged commit 6600506 into kubernetes:master May 7, 2016

derekwaynecarr mentioned this pull request May 11, 2016

router pod stuck in ContainerCreating openshift/openshift-ansible#1878

Closed

xiangpengzhao mentioned this pull request Dec 29, 2016

Pods can get stuck Terminating in certain situations if running a container fails #24819

Closed

ncdc deleted the pleg-retry branch February 13, 2017 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLEG: reinspect pods that failed prior inspections #25077

PLEG: reinspect pods that failed prior inspections #25077

ncdc commented May 3, 2016

ncdc commented May 4, 2016

yujuhong commented May 4, 2016

k8s-bot commented May 5, 2016

k8s-github-robot commented May 7, 2016

k8s-bot commented May 7, 2016

k8s-github-robot commented May 7, 2016

PLEG: reinspect pods that failed prior inspections #25077

PLEG: reinspect pods that failed prior inspections #25077

Conversation

ncdc commented May 3, 2016

ncdc commented May 4, 2016

yujuhong commented May 4, 2016

k8s-bot commented May 5, 2016

k8s-github-robot commented May 7, 2016

k8s-bot commented May 7, 2016

k8s-github-robot commented May 7, 2016