Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLEG: reinspect pods that failed prior inspections #25077

Merged
merged 1 commit into from
May 7, 2016

Conversation

ncdc
Copy link
Member

@ncdc ncdc commented May 3, 2016

Fix the following sequence of events:

  1. relist call 1 successfully inspects a pod (just has infra container)
  2. relist call 2 gets an error inspecting the same pod (has infra container and a transient
    container that failed to create) and doesn't update the old/new pod records
  3. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
    anything changed)

This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.

Fixes #24819

cc @kubernetes/rh-cluster-infra @kubernetes/sig-node

Fix the following sequence of events:

1. relist call 1 successfully inspects a pod (just has infra container)
1. relist call 2 gets an error inspecting the same pod (has infra container and a transient
container that failed to create) and doesn't update the old/new pod records
1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
anything changed)

This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.
@ncdc ncdc added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 3, 2016
@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 3, 2016
@ncdc
Copy link
Member Author

ncdc commented May 4, 2016

@yujuhong PTAL, thanks!

@yujuhong
Copy link
Contributor

yujuhong commented May 4, 2016

LGTM. Thanks!

kubelet will try to create the container infinitely with this PR, but I think that's expected and is consistent with other type of failures.

@yujuhong yujuhong added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2016
@ncdc ncdc added this to the v1.3 milestone May 5, 2016
@k8s-bot
Copy link

k8s-bot commented May 5, 2016

GCE e2e build/test passed for commit 3a87bfb.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 7, 2016

GCE e2e build/test passed for commit 3a87bfb.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants