Pods can get stuck Terminating in certain situations if running a container fails #24819

ncdc · 2016-04-26T20:39:53Z

In one of our clusters, we are not allowing containers to run images that use VOLUME instructions without a corresponding Kubernetes volume and volumeMount to back it. We've done this by making /var/lib/docker/volumes immutable (chattr +i). An attempt to run a pod like this results in the container creation falling, and there's no way for the pod to run successfully.

What we're seeing is that sometimes, attempts to delete the pod result in the pod getting stuck Terminating unless you delete it with a grace period of 0. This appears to be because of the following:

syncPod processes new pod and tries to create and run the container whose image has a VOLUME
while this is happening, the PLEG relisting happens
1. list all containers
2. the list includes the "in-flight" container (that is about to fail creation)
3. in-flight container is inspected
4. inspection fails with something like this: API error (500): Unknown device 27038b8e8b60506e1744c7ab15e06a3e24855fa925678e75d72b6e1fd3bd4a31
5. PLEG writes pod status to pod status cache, which includes the error above
6. at the end of relisting, PLEG updates the global timestamp for the pod status cache
syncPod finishes in error and requeues this pod
1 or more PLEG relistings happen, updating the global timestamp

another iteration of the pod worker retrieves the pod status from the cache

the cache's global timestamp is newer than the last sync time for the pod worker, so the cache returns the cached status, which includes the "Unknown device" error

because the error is non-nil, the worker exits its loop early, not calling syncPod (

kubernetes/pkg/kubelet/pod_workers.go

Lines 125 to 127 in 392fc66

    
           status, err := p.podCache.GetNewerThan(podID, lastSyncTime) 
        
           if err != nil { 
        
           	return err

)

because syncPod isn't called, the pod isn't deleted
repeat infinitely

@kubernetes/sig-node @kubernetes/rh-cluster-infra @ironcladlou @smarterclayton

The text was updated successfully, but these errors were encountered:

ncdc · 2016-04-26T20:44:24Z

This is easy to reproduce by inserting a 2-second sleep before the error return here:

kubernetes/pkg/kubelet/dockertools/manager.go

Line 1468 in e566948

return kubecontainer.ContainerID{}, fmt.Errorf("runContainer: %v", err)

yujuhong · 2016-04-26T22:03:26Z

@ncdc, does the container inspection always fail, or does it only fail in a certain window of time? I am trying to understand the scope of the problem.

If it's a permanent docker inspect failure, we've seen similar issues before due to docker bugs. Curretnly, kubelet doesn't know how to distinguish transient from permanent docker inspect failures, and will keep retrying. #24361 proposed a way to recognize this problem.

ncdc · 2016-04-27T00:31:26Z

@yujuhong it only ever inspects this particular container one time (during the PLEG relist where it sees the 500 Unknown device error), and that inspection fails. That's what's recorded in the PLEG status cache:

PLEG: Write status for bad/default: &{ID:4c480e27-0bca-11e6-9920-001c42b324b7 Name:bad Namespace:default IP: ContainerStatuses:[]} (err: API error (500): Unknown device 27038b8e8b60506e1744c7ab15e06a3e24855fa925678e75d72b6e1fd3bd4a31

If the kubelet or you tried to inspect container 27038b... again, it would fail with a 404 container not found error. However, such an inspection does not occur. This is because when the PLEG does more relists, there are no new events generated for this pod -- the "bad" container doesn't exist any more, and the infra container is exactly the same as it was before.

ncdc · 2016-04-27T00:32:51Z

oldPod and pod are identical, with just the infra container, here:

kubernetes/pkg/kubelet/pleg/generic.go

Lines 203 to 204 in 34d4eae

    
           oldPod := g.podRecords.getOld(pid) 
        
           pod := g.podRecords.getCurrent(pid)

.

yujuhong · 2016-04-27T01:01:06Z

@ncdc, so the timeline is:

PLEG relists and see infra container A
PLEG relists and sees new container B and fails the inspection with 500; it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)
Something deleted container B(?)
PLEG relists and see only container A, which is consistent with view at the end of (1). No inspection.

I think PLEG should explicitly record the pods/containers to retry the next time if error is encountered (as opposed to rely on maintaining the old view of container states). What do you think?
There is one thing that's unclear to me though. What exactly deleted container B in (2)?

ncdc · 2016-04-27T01:05:13Z

@yujuhong the creation of container B failed (permission denied error in /var/lib/docker/volumes). You might argue that the fact that the container even shows up when listing all containers is a bug, since Docker is in the middle of creating the container but that call hasn't returned yet.

I like the idea of recording what to retry. How are you thinking we can implement this?

ncdc · 2016-04-27T01:07:01Z

Also, re "it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)", it's the following continue statement that keeps it from updating the internal state (oldPod/pod):

kubernetes/pkg/kubelet/pleg/generic.go

Line 229 in 34d4eae

continue

yujuhong · 2016-04-27T01:28:29Z

@yujuhong the creation of container B failed (permission denied error in /var/lib/docker/volumes). You might argue that the fact that the container even shows up when listing all containers is a bug, since Docker is in the middle of creating the container but that call hasn't returned yet.

So the container appear briefly in docker ps during creation, but then disappear when it fails? That doesn't sound like the best behavior, but anyway, we should make PLEG more robust.

Also, re "it updates the cache with the error and maintains the old view of only seeing A (so that it'd retry the next time)", it's the following continue statement that keeps it from updating the internal state (oldPod/pod):

Yes, it was done intentionally so that PLEG will retry the inspection again in the next relist, assuming the container was still there.

We should send out the events regardless (even if the inspect fails), but record the pod ID internally so that we can retry and update the cache in the next relist. The downside is that we'd decouple the event and cache update even further; the consumer of the event may not see an up-to-date cache. I think it's reasonable to assume taht the consumers will retry if the cache is not ready, like what the pod workers do today, as long as this doesn't happen often.

ncdc · 2016-04-27T01:33:44Z

The key is to make sure that the cache doesn't continue to return the same PodStatus over and over again along with the inspection error.

pmorie · 2016-04-27T03:42:00Z

We need to add context to the event created when the pleg inspect fails

ncdc · 2016-04-29T19:27:06Z

I think it's reasonable to assume taht the consumers will retry if the cache is not ready, like what the pod workers do today, as long as this doesn't happen often.

I'm actually seeing this happen frequently (the presence of the "Unknown device" errors during PLEG inspection), but it's not a problem because it's happening during the creation of a container, and a subsequent PLEG relist ends up with a successful inspection.

I wrote a sample go program to create a container in 1 goroutine and then inspect it as often and as fast as possible until it gets no error. I was able to have the inspection fail 45-50 times with "Unknown device" prior to it succeeding. While this isn't really a valid test (hammering the Docker daemon to inspect 1 container), it does highlight that this can certainly happen.

ncdc · 2016-04-29T20:13:20Z

It appears that removing the log message and continue statement from

kubernetes/pkg/kubelet/pleg/generic.go

Line 228 in 34d4eae

    
           glog.Errorf("PLEG: Ignoring events for pod %s/%s: %v", pod.Name, pod.Namespace, err)

solves the problem. It does have the side effect that the pod worker will keep trying to create the container over and over again, and it will keep failing (in my example where we disallow images with Docker volumes). Maybe we need to implement what is suggested here:

kubernetes/pkg/kubelet/dockertools/manager.go

Line 1974 in d6f26b6

// TODO(bburns) : Perhaps blacklist a container after N failures?

?

ncdc · 2016-05-02T18:21:26Z

@yujuhong what are the implications (if any) of removing that continue statement?

yujuhong · 2016-05-02T18:48:36Z

@yujuhong what are the implications (if any) of removing that continue statement?

Once you remove the continue statement, if a normal inspection fails due to some transient error, PLEG will send out the event and it won't retry in the next relist.
For example, if you create a container and it becomes running. PLEG sees this new container in the listing and tries to inspect it, but fails. It will then update the cache with the inspection error, and sends out the ContainerStarted event. The pod worker wakes up due to the event, but couldn't sync because of the inspection error. In the next relist, PLEG won't do anything. The cache stays stale (with the inspection error) indefinitely.

In short, we can' just remove the continue statement. We need to ensure that PLEG will retry the inspection in the next relist like I said in #24819 (comment)

ncdc · 2016-05-02T18:52:18Z

So we need to add a new podsToReinspect []*kubecontainer.Pod to GenericPLEG and process that each time through relist, or something to that effect?

yujuhong · 2016-05-02T18:53:26Z

So we need to add a new podsToReinspect []*kubecontainer.Pod to GenericPLEG and process that each time through relist, or something to that effect?

Yes, in addition to detecting container changes via docker ps -a, PLEG will also inspect these pods.

ncdc · 2016-05-02T18:53:39Z

Ok, I can do that 😄

Automatic merge from submit-queue PLEG: reinspect pods that failed prior inspections Fix the following sequence of events: 1. relist call 1 successfully inspects a pod (just has infra container) 1. relist call 2 gets an error inspecting the same pod (has infra container and a transient container that failed to create) and doesn't update the old/new pod records 1. relist calls 3+ don't inspect the pod any more (just has infra container so it doesn't look like anything changed) This change adds a new list that keeps track of pods that failed inspection and retries them the next time relist is called. Without this change, a pod in this state would never be inspected again, its entry in the status cache would never be updated, and the pod worker would never call syncPod again because the most recent entry in the status cache has an error associated with it. Without this change, pods in this state would be stuck Terminating forever, unless the user issued a deletion with a grace period value of 0. Fixes #24819 cc @kubernetes/rh-cluster-infra @kubernetes/sig-node

DenisIzmaylov · 2016-06-07T03:51:37Z

Is anything changed? It seemed to me I got this issue.

ncdc · 2016-06-07T13:58:42Z

@DenisIzmaylov this is a very specific issue that has a very particular root cause for why pods get stuck terminating, and it has been fixed by 25077. There may be other reasons why pods get stuck terminating, so I'm wondering if you're maybe hitting a different root cause?

DenisIzmaylov · 2016-06-07T20:23:32Z

May be. How I can see logs for this issue?

yujuhong · 2016-06-07T22:00:48Z

@DenisIzmaylov, one common reason for pods to be stuck at terminating is that the worker goroutine responsible for the pod is stuck at certain operations (e.g., image pulling, etc).
You can check the output kubectl describe pod <pod_name> to see status and events associated with the pod. If there are no events, that may be because that all events get deleted after an hour to save space. You can also check the kubelet log (/var/log/kubelet.log or go through journalctl if it manages your log) to see what happened to the pod and what was the last operation kubelet did for the pod.

By the way, you can forcibly delete the pod with kubectl delete pod --graceful-period=0 <pod_name>

DenisIzmaylov · 2016-06-07T23:50:51Z

I have kubectl get pods output like this:

...
my-nginx-3869474375-oj1i0    0/1       Terminating   0          20h
my-nginx-3869474375-oui1w    0/1       Terminating   0          20h
my-nginx-3869474375-p0ixa    0/1       Terminating   0          20h
my-nginx-3869474375-p21aq    0/1       Terminating   0          20h
my-nginx-3869474375-p3e3q    0/1       Terminating   0          20h
my-nginx-3869474375-p5qxp    0/1       Terminating   0          20h
my-nginx-3869474375-pcesx    0/1       Terminating   0          20h
my-nginx-3869474375-q238e    0/1       Terminating   0          20h
my-nginx-3869474375-qto18    0/1       Terminating   0          20h
my-nginx-3869474375-r4u6b    0/1       Terminating   0          20h
my-nginx-3869474375-rlv4a    0/1       Terminating   0          20h
my-nginx-3869474375-rucy2    0/1       Terminating   0          20h
my-nginx-3869474375-rvmqo    0/1       Terminating   0          20h
my-nginx-3869474375-ry7bj    0/1       Terminating   0          20h
my-nginx-3869474375-s2mwi    0/1       Terminating   0          20h
...

yujuhong · 2016-06-08T00:01:53Z

kubectl describe pod <pod_name> will have more details about a single pod.

DenisIzmaylov · 2016-06-08T00:05:38Z

I run kubectl delete pods --grace-period=0 --all to remove all pods (it's a new cluster so it's OK). But few seconds later they are returned back. :\

Details about a single pod:

Name:               my-nginx-3869474375-zygrz
Namespace:          default
Node:               <removed IP>
Labels:             pod-template-hash=3869474375,run=my-nginx
Status:             Terminating (expires Tue, 07 Jun 2016 20:03:08 -0400)
Termination Grace Period:   30s
IP:             
Controllers:            ReplicaSet/my-nginx-3869474375
Containers:
  my-nginx:
    Image:  nginx
    Port:   81/TCP
    QoS Tier:
      cpu:  BestEffort
      memory:   BestEffort
    Environment Variables:
Volumes:
  default-token-i7wig:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-i7wig
Events:
  FirstSeen LastSeen    Count   From            SubobjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  1m        1m      1   {default-scheduler }            Normal      Scheduled       Successfully assigned my-nginx-3869474375-zygrz to <removed IP>
  1m        1m      1   {kubelet <removed IP>}          Warning     MissingClusterDNS   kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to DNSDefault policy.

DenisIzmaylov · 2016-06-08T00:07:15Z

But I've not configured and do not have any mention for DNS in my cloud configs.

yujuhong · 2016-06-08T00:15:03Z

In the pod spec, if you don't specify DNS policy, the default will be "ClusterFirst".

How long did your pods stay in the terminating state? Have they ever become running before you attempted to delete them?
If you are ok with providing kubelet.log, I can also take a look.

ncdc · 2016-06-08T00:17:15Z

Would it be possible to move this discussion to either StackOverflow or a different issue?

DenisIzmaylov · 2016-06-08T00:22:00Z

@yujuhong Hm, I've not found it:

 $ sudo ls /var/log/
btmp              btmp-20160419.gz  containers/       faillog           journal/          lastlog           wtmp

DenisIzmaylov · 2016-06-08T00:22:24Z

@ncdc of course

DenisIzmaylov · 2016-06-08T00:26:37Z

I've created a new one.

xiangpengzhao · 2016-12-29T09:51:14Z

@ncdc @yujuhong PR #25077 should have fix this issue already in Kubernetes 1.3.8, but I encountered almost the same issue in our env.

A pod got stuck Terminating after deleting it's rc, and then new pods got stuck ContainerCreating when creating it's rc. When I try to delete the rc of the new pod, the new pod got stuck Terminating, too.

Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.8", GitCommit:"693ef591120267007be359f97191a6253e0e4fb5", GitTreeState:"clean", BuildDate:"2016-09-28T03:03:21Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.8", GitCommit:"693ef591120267007be359f97191a6253e0e4fb5", GitTreeState:"clean", BuildDate:"2016-09-28T02:52:25Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Not sure if it's in the same certain situations @ncdc mentioned. I just got things like this in kubelet.log

E1229 17:10:53.705545    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:10:55.703951    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:10:57.705719    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
I1229 17:10:57.742208    2241 reconciler.go:294] MountVolume operation started for volume "kubernetes.io/secret/9df52036-c2c6-11e6-9c7f-fa163e48f799-default-token-6zqlr" (spec.Name: "default-token-6zqlr") to pod "9df52036-c2c6-11e6-9c7f-fa163e48f799" (UID: "9df52036-c2c6-11e6-9c7f-fa163e48f799"). Volume is already mounted to pod, but remount was requested.
I1229 17:10:57.745750    2241 operation_executor.go:775] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9df52036-c2c6-11e6-9c7f-fa163e48f799-default-token-6zqlr" (spec.Name: "default-token-6zqlr") pod "9df52036-c2c6-11e6-9c7f-fa163e48f799" (UID: "9df52036-c2c6-11e6-9c7f-fa163e48f799").
E1229 17:10:59.704565    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory
E1229 17:11:01.705682    2241 kubelet_getters.go:248] Could not read directory /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: open /var/lib/kubelet/pods/96ad82fe-c4eb-11e6-9c7f-fa163e48f799/volumes: no such file or directory

sellers · 2018-04-03T13:46:52Z

I have had this happen to me - as part of an issue with resource limits and the cluster autoscaler happening at the same time as a kops rolling-update. This left that pod in terminating state w/o a replica-set or any deployment set, stateful set, jobs, etc.

I tried the force delete and it hung for 20 minutes before I stopped it. The describe shows that the pod didn't scale up another node (autoscaler was failing) and the PodScheduled is False because it was requested to terminate. I suspect that the log entry was cleaned up as mentioned above so the kubelet(?) does not know its status and can not determine it.

clicking on it in the Dashboard of course throws an error because the ReplicaSet (or parent) is missing.

Does anyone have any other ideas of how to purge it from the list? (beside the --grace-period)

…idation_message Bug 1740604: UPSTREAM 89300: Expand the PVC validation messages Origin-commit: 298daf9157d28e983e515a957886fe2d9c8c5c3f

ncdc added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 26, 2016

ncdc mentioned this issue Apr 27, 2016

daemonset test flake openshift/origin#8647

Closed

ncdc mentioned this issue May 3, 2016

PLEG: reinspect pods that failed prior inspections #25077

Merged

k8s-github-robot closed this as completed in #25077 May 7, 2016

DenisIzmaylov mentioned this issue Jun 8, 2016

pods can get stuck Terminating in certain situations #27013

Closed

aclisp mentioned this issue May 15, 2017

Pods 长时间处于 Terminating 状态，无法结束。 aclisp/voyage#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods can get stuck Terminating in certain situations if running a container fails #24819

Pods can get stuck Terminating in certain situations if running a container fails #24819

ncdc commented Apr 26, 2016

ncdc commented Apr 26, 2016

yujuhong commented Apr 26, 2016

ncdc commented Apr 27, 2016

ncdc commented Apr 27, 2016

yujuhong commented Apr 27, 2016

ncdc commented Apr 27, 2016

ncdc commented Apr 27, 2016

yujuhong commented Apr 27, 2016

ncdc commented Apr 27, 2016

pmorie commented Apr 27, 2016

ncdc commented Apr 29, 2016

ncdc commented Apr 29, 2016 •

edited

Loading

ncdc commented May 2, 2016

yujuhong commented May 2, 2016

ncdc commented May 2, 2016

yujuhong commented May 2, 2016

ncdc commented May 2, 2016

DenisIzmaylov commented Jun 7, 2016 •

edited

Loading

ncdc commented Jun 7, 2016

DenisIzmaylov commented Jun 7, 2016

yujuhong commented Jun 7, 2016

DenisIzmaylov commented Jun 7, 2016

yujuhong commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

yujuhong commented Jun 8, 2016

ncdc commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016 •

edited

Loading

xiangpengzhao commented Dec 29, 2016

sellers commented Apr 3, 2018

Pods can get stuck Terminating in certain situations if running a container fails #24819

Pods can get stuck Terminating in certain situations if running a container fails #24819

Comments

ncdc commented Apr 26, 2016

ncdc commented Apr 26, 2016

yujuhong commented Apr 26, 2016

ncdc commented Apr 27, 2016

ncdc commented Apr 27, 2016

yujuhong commented Apr 27, 2016

ncdc commented Apr 27, 2016

ncdc commented Apr 27, 2016

yujuhong commented Apr 27, 2016

ncdc commented Apr 27, 2016

pmorie commented Apr 27, 2016

ncdc commented Apr 29, 2016

ncdc commented Apr 29, 2016 • edited Loading

ncdc commented May 2, 2016

yujuhong commented May 2, 2016

ncdc commented May 2, 2016

yujuhong commented May 2, 2016

ncdc commented May 2, 2016

DenisIzmaylov commented Jun 7, 2016 • edited Loading

ncdc commented Jun 7, 2016

DenisIzmaylov commented Jun 7, 2016

yujuhong commented Jun 7, 2016

DenisIzmaylov commented Jun 7, 2016

yujuhong commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

yujuhong commented Jun 8, 2016

ncdc commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016

DenisIzmaylov commented Jun 8, 2016 • edited Loading

xiangpengzhao commented Dec 29, 2016

sellers commented Apr 3, 2018

ncdc commented Apr 29, 2016 •

edited

Loading

DenisIzmaylov commented Jun 7, 2016 •

edited

Loading

DenisIzmaylov commented Jun 8, 2016 •

edited

Loading