Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods can get stuck Terminating in certain situations if running a container fails #24819

Closed
ncdc opened this issue Apr 26, 2016 · 32 comments
Closed
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@ncdc
Copy link
Member

ncdc commented Apr 26, 2016

In one of our clusters, we are not allowing containers to run images that use VOLUME instructions without a corresponding Kubernetes volume and volumeMount to back it. We've done this by making /var/lib/docker/volumes immutable (chattr +i). An attempt to run a pod like this results in the container creation falling, and there's no way for the pod to run successfully.

What we're seeing is that sometimes, attempts to delete the pod result in the pod getting stuck Terminating unless you delete it with a grace period of 0. This appears to be because of the following:

  1. syncPod processes new pod and tries to create and run the container whose image has a VOLUME
  2. while this is happening, the PLEG relisting happens
    1. list all containers
    2. the list includes the "in-flight" container (that is about to fail creation)
    3. in-flight container is inspected
    4. inspection fails with something like this: API error (500): Unknown device 27038b8e8b60506e1744c7ab15e06a3e24855fa925678e75d72b6e1fd3bd4a31
    5. PLEG writes pod status to pod status cache, which includes the error above
    6. at the end of relisting, PLEG updates the global timestamp for the pod status cache
  3. syncPod finishes in error and requeues this pod
  4. 1 or more PLEG relistings happen, updating the global timestamp
  5. another iteration of the pod worker retrieves the pod status from the cache
    1. the cache's global timestamp is newer than the last sync time for the pod worker, so the cache returns the cached status, which includes the "Unknown device" error
    2. because the error is non-nil, the worker exits its loop early, not calling syncPod (
      status, err := p.podCache.GetNewerThan(podID, lastSyncTime)
      if err != nil {
      return err
      )
  6. because syncPod isn't called, the pod isn't deleted
  7. repeat infinitely

@kubernetes/sig-node @kubernetes/rh-cluster-infra @ironcladlou @smarterclayton

@ncdc
Copy link
Member Author

ncdc commented Apr 26, 2016

This is easy to reproduce by inserting a 2-second sleep before the error return here:

return kubecontainer.ContainerID{}, fmt.Errorf("runContainer: %v", err)

@ncdc ncdc added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 26, 2016
@yujuhong
Copy link
Contributor

@ncdc, does the container inspection always fail, or does it only fail in a certain window of time? I am trying to understand the scope of the problem.

If it's a permanent docker inspect failure, we've seen similar issues before due to docker bugs. Curretnly, kubelet doesn't know how to distinguish transient from permanent docker inspect failures, and will keep retrying. #24361 proposed a way to recognize this problem.

@ncdc
Copy link
Member Author

ncdc commented Apr 27, 2016

@yujuhong it only ever inspects this particular container one time (during the PLEG relist where it sees the 500 Unknown device error), and that inspection fails. That's what's recorded in the PLEG status cache:

PLEG: Write status for bad/default: &{ID:4c480e27-0bca-11e6-9920-001c42b324b7 Name:bad Namespace:default IP: ContainerStatuses:[]} (err: API error (500): Unknown device 27038b8e8b60506e1744c7ab15e06a3e24855fa925678e75d72b6e1fd3bd4a31

If the kubelet or you tried to inspect container 27038b... again, it would fail with a 404 container not found error. However, such an inspection does not occur. This is because when the PLEG does more relists, there are no new events generated for this pod -- the "bad" container doesn't exist any more, and the infra container is exactly the same as it was before.

@ncdc
Copy link
Member Author

ncdc commented Apr 27, 2016

oldPod and pod are identical, with just the infra container, here:

oldPod := g.podRecords.getOld(pid)
pod := g.podRecords.getCurrent(pid)
.

@yujuhong
Copy link
Contributor

@ncdc, so the timeline is:

  1. PLEG relists and see infra container A
  2. PLEG relists and sees new container B and fails the inspection with 500; it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)
  3. Something deleted container B(?)
  4. PLEG relists and see only container A, which is consistent with view at the end of (1). No inspection.

I think PLEG should explicitly record the pods/containers to retry the next time if error is encountered (as opposed to rely on maintaining the old view of container states). What do you think?
There is one thing that's unclear to me though. What exactly deleted container B in (2)?

@ncdc
Copy link
Member Author

ncdc commented Apr 27, 2016

@yujuhong the creation of container B failed (permission denied error in /var/lib/docker/volumes). You might argue that the fact that the container even shows up when listing all containers is a bug, since Docker is in the middle of creating the container but that call hasn't returned yet.

I like the idea of recording what to retry. How are you thinking we can implement this?

@ncdc
Copy link
Member Author

ncdc commented Apr 27, 2016

Also, re "it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)", it's the following continue statement that keeps it from updating the internal state (oldPod/pod):

@yujuhong
Copy link
Contributor

@yujuhong the creation of container B failed (permission denied error in /var/lib/docker/volumes). You might argue that the fact that the container even shows up when listing all containers is a bug, since Docker is in the middle of creating the container but that call hasn't returned yet.

So the container appear briefly in docker ps during creation, but then disappear when it fails? That doesn't sound like the best behavior, but anyway, we should make PLEG more robust.

Also, re "it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)", it's the following continue statement that keeps it from updating the internal state (oldPod/pod):

Yes, it was done intentionally so that PLEG will retry the inspection again in the next relist, assuming the container was still there.

We should send out the events regardless (even if the inspect fails), but record the pod ID internally so that we can retry and update the cache in the next relist. The downside is that we'd decouple the event and cache update even further; the consumer of the event may not see an up-to-date cache. I think it's reasonable to assume taht the consumers will retry if the cache is not ready, like what the pod workers do today, as long as this doesn't happen often.

@ncdc
Copy link
Member Author

ncdc commented Apr 27, 2016

The key is to make sure that the cache doesn't continue to return the same PodStatus over and over again along with the inspection error.

@pmorie
Copy link
Member

pmorie commented Apr 27, 2016

We need to add context to the event created when the pleg inspect fails

@ncdc
Copy link
Member Author

ncdc commented Apr 29, 2016

I think it's reasonable to assume taht the consumers will retry if the cache is not ready, like what the pod workers do today, as long as this doesn't happen often.

I'm actually seeing this happen frequently (the presence of the "Unknown device" errors during PLEG inspection), but it's not a problem because it's happening during the creation of a container, and a subsequent PLEG relist ends up with a successful inspection.

I wrote a sample go program to create a container in 1 goroutine and then inspect it as often and as fast as possible until it gets no error. I was able to have the inspection fail 45-50 times with "Unknown device" prior to it succeeding. While this isn't really a valid test (hammering the Docker daemon to inspect 1 container), it does highlight that this can certainly happen.

@ncdc
Copy link
Member Author

ncdc commented Apr 29, 2016

It appears that removing the log message and continue statement from

glog.Errorf("PLEG: Ignoring events for pod %s/%s: %v", pod.Name, pod.Namespace, err)
solves the problem. It does have the side effect that the pod worker will keep trying to create the container over and over again, and it will keep failing (in my example where we disallow images with Docker volumes). Maybe we need to implement what is suggested here:
// TODO(bburns) : Perhaps blacklist a container after N failures?
?

@ncdc
Copy link
Member Author

ncdc commented May 2, 2016

@yujuhong what are the implications (if any) of removing that continue statement?

@yujuhong
Copy link
Contributor

yujuhong commented May 2, 2016

@yujuhong what are the implications (if any) of removing that continue statement?

Once you remove the continue statement, if a normal inspection fails due to some transient error, PLEG will send out the event and it won't retry in the next relist.
For example, if you create a container and it becomes running. PLEG sees this new container in the listing and tries to inspect it, but fails. It will then update the cache with the inspection error, and sends out the ContainerStarted event. The pod worker wakes up due to the event, but couldn't sync because of the inspection error. In the next relist, PLEG won't do anything. The cache stays stale (with the inspection error) indefinitely.

In short, we can' just remove the continue statement. We need to ensure that PLEG will retry the inspection in the next relist like I said in #24819 (comment)

@ncdc
Copy link
Member Author

ncdc commented May 2, 2016

So we need to add a new podsToReinspect []*kubecontainer.Pod to GenericPLEG and process that each time through relist, or something to that effect?

@yujuhong
Copy link
Contributor

yujuhong commented May 2, 2016

So we need to add a new podsToReinspect []*kubecontainer.Pod to GenericPLEG and process that each time through relist, or something to that effect?

Yes, in addition to detecting container changes via docker ps -a, PLEG will also inspect these pods.

@ncdc
Copy link
Member Author

ncdc commented May 2, 2016

Ok, I can do that 😄

k8s-github-robot pushed a commit that referenced this issue May 7, 2016
Automatic merge from submit-queue

PLEG: reinspect pods that failed prior inspections

Fix the following sequence of events:

1. relist call 1 successfully inspects a pod (just has infra container)
1. relist call 2 gets an error inspecting the same pod (has infra container and a transient
container that failed to create) and doesn't update the old/new pod records
1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like
anything changed)

This change adds a new list that keeps track of pods that failed inspection and retries them the
next time relist is called. Without this change, a pod in this state would never be inspected again,
its entry in the status cache would never be updated, and the pod worker would never call syncPod
again because the most recent entry in the status cache has an error associated with it. Without
this change, pods in this state would be stuck Terminating forever, unless the user issued a
deletion with a grace period value of 0.

Fixes #24819 

cc @kubernetes/rh-cluster-infra @kubernetes/sig-node
@DenisIzmaylov
Copy link

DenisIzmaylov commented Jun 7, 2016

Is anything changed? It seemed to me I got this issue.

@ncdc
Copy link
Member Author

ncdc commented Jun 7, 2016

@DenisIzmaylov this is a very specific issue that has a very particular root cause for why pods get stuck terminating, and it has been fixed by 25077. There may be other reasons why pods get stuck terminating, so I'm wondering if you're maybe hitting a different root cause?

@DenisIzmaylov
Copy link

May be. How I can see logs for this issue?

@yujuhong
Copy link
Contributor

yujuhong commented Jun 7, 2016

@DenisIzmaylov, one common reason for pods to be stuck at terminating is that the worker goroutine responsible for the pod is stuck at certain operations (e.g., image pulling, etc).
You can check the output kubectl describe pod <pod_name> to see status and events associated with the pod. If there are no events, that may be because that all events get deleted after an hour to save space. You can also check the kubelet log (/var/log/kubelet.log or go through journalctl if it manages your log) to see what happened to the pod and what was the last operation kubelet did for the pod.

By the way, you can forcibly delete the pod with kubectl delete pod --graceful-period=0 <pod_name>

@DenisIzmaylov
Copy link

I have kubectl get pods output like this:

...
my-nginx-3869474375-oj1i0    0/1       Terminating   0          20h
my-nginx-3869474375-oui1w    0/1       Terminating   0          20h
my-nginx-3869474375-p0ixa    0/1       Terminating   0          20h
my-nginx-3869474375-p21aq    0/1       Terminating   0          20h
my-nginx-3869474375-p3e3q    0/1       Terminating   0          20h
my-nginx-3869474375-p5qxp    0/1       Terminating   0          20h
my-nginx-3869474375-pcesx    0/1       Terminating   0          20h
my-nginx-3869474375-q238e    0/1       Terminating   0          20h
my-nginx-3869474375-qto18    0/1       Terminating   0          20h
my-nginx-3869474375-r4u6b    0/1       Terminating   0          20h
my-nginx-3869474375-rlv4a    0/1       Terminating   0          20h
my-nginx-3869474375-rucy2    0/1       Terminating   0          20h
my-nginx-3869474375-rvmqo    0/1       Terminating   0          20h
my-nginx-3869474375-ry7bj    0/1       Terminating   0          20h
my-nginx-3869474375-s2mwi    0/1       Terminating   0          20h
...

@yujuhong
Copy link
Contributor

yujuhong commented Jun 8, 2016

kubectl describe pod <pod_name> will have more details about a single pod.

@DenisIzmaylov
Copy link

I run kubectl delete pods --grace-period=0 --all to remove all pods (it's a new cluster so it's OK). But few seconds later they are returned back. :\

Details about a single pod:

Name:               my-nginx-3869474375-zygrz
Namespace:          default
Node:               <removed IP>
Labels:             pod-template-hash=3869474375,run=my-nginx
Status:             Terminating (expires Tue, 07 Jun 2016 20:03:08 -0400)
Termination Grace Period:   30s
IP:             
Controllers:            ReplicaSet/my-nginx-3869474375
Containers:
  my-nginx:
    Image:  nginx
    Port:   81/TCP
    QoS Tier:
      cpu:  BestEffort
      memory:   BestEffort
    Environment Variables:
Volumes:
  default-token-i7wig:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-i7wig
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  1m        1m      1   {default-scheduler }            Normal      Scheduled       Successfully assigned my-nginx-3869474375-zygrz to <removed IP>
  1m        1m      1   {kubelet <removed IP>}          Warning     MissingClusterDNS   kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to DNSDefault policy.

@DenisIzmaylov
Copy link

But I've not configured and do not have any mention for DNS in my cloud configs.

@yujuhong
Copy link
Contributor

yujuhong commented Jun 8, 2016

In the pod spec, if you don't specify DNS policy, the default will be "ClusterFirst".

How long did your pods stay in the terminating state? Have they ever become running before you attempted to delete them?
If you are ok with providing kubelet.log, I can also take a look.

@ncdc
Copy link
Member Author

ncdc commented Jun 8, 2016

Would it be possible to move this discussion to either StackOverflow or a different issue?

@DenisIzmaylov
Copy link

@yujuhong Hm, I've not found it:

 $ sudo ls /var/log/
btmp              btmp-20160419.gz  containers/       faillog           journal/          lastlog           wtmp       

@DenisIzmaylov
Copy link

@ncdc of course

@DenisIzmaylov
Copy link

DenisIzmaylov commented Jun 8, 2016

I've created a new one.

@xiangpengzhao
Copy link
Contributor

@ncdc @yujuhong PR #25077 should have fix this issue already in Kubernetes 1.3.8, but I encountered almost the same issue in our env.

A pod got stuck Terminating after deleting it's rc, and then new pods got stuck ContainerCreating when creating it's rc. When I try to delete the rc of the new pod, the new pod got stuck Terminating, too.

Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.8", GitCommit:"693ef591120267007be359f97191a6253e0e4fb5", GitTreeState:"clean", BuildDate:"2016-09-28T03:03:21Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.8", GitCommit:"693ef591120267007be359f97191a6253e0e4fb5", GitTreeState:"clean", BuildDate:"2016-09-28T02:52:25Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Not sure if it's in the same certain situations @ncdc mentioned. I just got things like this in kubelet.log

E1229 17:10:53.705545    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:10:55.703951    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:10:57.705719    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
I1229 17:10:57.742208    2241 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/secret/9df52036-c2c6-11e6-9c7f-fa163e48f799-default-token-6zqlr" (spec.Name: "default-token-6zqlr") to pod "9df52036-c2c6-11e6-9c7f-fa163e48f799" (UID: "9df52036-c2c6-11e6-9c7f-fa163e48f799"). Volume is already mounted to pod, but remount was requested.
I1229 17:10:57.745750    2241 operation_executor.go:775] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9df52036-c2c6-11e6-9c7f-fa163e48f799-default-token-6zqlr" (spec.Name: "default-token-6zqlr") pod "9df52036-c2c6-11e6-9c7f-fa163e48f799" (UID: "9df52036-c2c6-11e6-9c7f-fa163e48f799").
E1229 17:10:59.704565    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:11:01.705682    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory

@sellers
Copy link

sellers commented Apr 3, 2018

I have had this happen to me - as part of an issue with resource limits and the cluster autoscaler happening at the same time as a kops rolling-update. This left that pod in terminating state w/o a replica-set or any deployment set, stateful set, jobs, etc.

I tried the force delete and it hung for 20 minutes before I stopped it. The describe shows that the pod didn't scale up another node (autoscaler was failing) and the PodScheduled is False because it was requested to terminate. I suspect that the log entry was cleaned up as mentioned above so the kubelet(?) does not know its status and can not determine it.

clicking on it in the Dashboard of course throws an error because the ReplicaSet (or parent) is missing.

Does anyone have any other ideas of how to purge it from the list? (beside the --grace-period)

openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Apr 21, 2020
…idation_message

Bug 1740604: UPSTREAM 89300: Expand the PVC validation messages

Origin-commit: 298daf9157d28e983e515a957886fe2d9c8c5c3f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

6 participants