Upon restart, kubelet tries to start all pods including terminated ones #37447

fgrzadkowski · 2016-11-24T13:23:17Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration: GKE
OS (e.g. from /etc/os-release): default
Kernel (e.g. uname -a):
Install tools:
Others:

What happened:
I was trying to reproduce #34772 using these instructions. Due to bug in scheduler we were assigning too many pods per node. In most cases kubelet was correctly discarding them by marking them as OutOfcpu. However two strange things happened:

upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.
for two pods that were incorrectly assigned (due to scheduler bug) during kubelet restart kubelet completely ignored them - they are still pending, with node assigned for 12h.

I have a cluster that I can add someone to investigate.

What you expected to happen:

OutOfcpu pods should not be retried after restart
new pods, assigned during kubelet restart, should be discarded with OutOfcpu error.

How to reproduce it (as minimally and precisely as possible):

Run these instructions.

Anything else do we need to know:

Marking as release-blocker until it's triaged by node team.

The text was updated successfully, but these errors were encountered:

fgrzadkowski · 2016-11-24T13:26:42Z

I added @dchen1107 and @yujuhong to my project to let them investigate further.

GKE cluster: test-cluster-1 in europe-west1-b
Node: gke-test-cluster-1-default-pool-44d8ae2f-kpn9

fgrzadkowski · 2016-11-24T13:27:22Z

@gmarek @wojtek-t @kubernetes/sig-node

dims · 2016-11-27T19:59:50Z

Picked a couple of folks and assigned them as this is a release-blocker, please assign someone else if appropriate.

yujuhong · 2016-11-29T01:10:06Z

@fgrzadkowski, we need the project name/id to access the cluster.

upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.

Why did kubelet restart? Did it ever have a chance to persist the pod status to the apiserver?

for two pods that were incorrectly assigned (due to scheduler bug) during kubelet restart kubelet completely ignored them - they are still pending, with node assigned for 12h.

What's the output of kubectl describe pods for these two pending pods? Did kubelet indicate any reason why the pod was pending?

dchen1107 · 2016-11-29T02:03:03Z

@fgrzadkowski I couldn't access your cluster without projectid. Also there is no notification email nowadays when you added us to your project.

I also couldn't reproduce the problem by simply restarting Kubelet.

fgrzadkowski · 2016-11-29T16:32:32Z

Sent over email.

…

-- Filip

On Tue, Nov 29, 2016 at 3:03 AM, Dawn Chen ***@***.***> wrote: @fgrzadkowski <https://github.com/fgrzadkowski> I couldn't access your cluster without projectid. Also there is no notification email nowadays when you added us to your project. I also couldn't reproduce the problem by simply restarting Kubelet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37447 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKUcdqYHZZIRvkMiatn9-_sgcNAONIR0ks5rC4frgaJpZM4K7pY0> .

yujuhong · 2016-11-29T18:37:13Z

I did some basic checkup in the cluster, but will be offline for the next hour or two. Posting the findings below so other people can pick up if they want.

I can't find logs around the time kubelet restarted. It's possible that the log is too old and gone forever. This means I couldn't examine the kubelet pod admission steps. @fgrzadkowski, I still don't know why kubelet restarted in the first place. If you could describe the situation more, it might help.
The two pending pods are not seeing by kubelet at all. I check the kubelet /pods endpoint (where kubelet stores the pod specs from sources such as the apiserver), and those two pods were nowhere to be found. I initially suspected that the kubelet syncloop goroutine might be stuck and kubelet's not working correctly, but the health check seems fine. I've also tried the two things to verify that kubelet's functioning ok (since gke only has the less-informative V2 kubelet logs):
- Kill the fluentd infra container. Observe that kubelet restart the pod. This proves that kubelet can detect state changes and wakes up the pod worker to perform syncs.
- Edit the fluentd pod status using kubectl edit. Observe that kubelet rectify the pod status. This proves that kubelet can receives updates from the apiserver.

This shows that only the two pods were not delivered to kubelet over watch. I don't know what the current relisting period is, and whether the particular kubelet has gone through the resliting. One could probably poke this more by modifying the spec of the pending pods to see if it triggers an update.

kubelet log is full of errors from the volume manager complaining about non-existent secrets:

Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 17:25:34.230648    2228 reconciler.go:188] UnmountVolume operation started for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (spec.Name: "default-token-r4hlp") from pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181").
Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: E1129 17:25:34.232112    2228 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp\" (\"fee1aef7-b1ab-11e6-848f-42010a840181\")" failed. No retries permitted until 2016-11-29 17:27:34.232087714 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (volume.spec.Name: "default-token-r4hlp") pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181") with: IsLikelyNotMountPoint("/var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp"): stat /var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp: no such file or directory

/cc @kubernetes/sig-storage @saad-ali @jingxu97 to see what could've caused this.

yujuhong · 2016-11-29T22:24:24Z

For (2) in my previous comment, I've verified that this is a kubelet bug (i.e., not related to the apiserver watch). What I did was:

Edit the image used in the pending pod (kubectl edit pod)
Get a UPDATE message from the kubelet log:

Nov 29 21:44:14 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 21:44:14.988756    2228 kubelet.go:2300] SyncLoop (UPDATE, "api"): "my-app-2781155866-wqtqt_default(72d911fd-b1e4-11e6-848f-42010a840181)"

What this means is that kubelet itself thinks this is not a new pod, but merely an update. To further confirm this, I access kubelet's /pods endpoint:

            "metadata": {
                "annotations": {
                    "kubernetes.io/config.seen": "2016-11-24T01:24:05.362036637Z",
                    "kubernetes.io/config.source": "api",
                    "kubernetes.io/created-by": "{\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"ReplicaSet\",\"namespace\":\"default\",\"name\":\"my-app-2781155866\",\"uid\":\"48
22986e-b181-11e6-b21f-42010a840181\",\"apiVersion\":\"extensions\",\"resourceVersion\":\"182642\"}}\n"
                },
                "creationTimestamp": "2016-11-24T01:22:21Z",
                "generateName": "my-app-2781155866-",
                "labels": {
                    "name": "my-app",
                    "pod-template-hash": "2781155866"
                },
                "name": "my-app-2781155866-wqtqt",
                "namespace": "default",
                "ownerReferences": [
                    {
                        "apiVersion": "extensions/v1beta1",
                        "controller": true,
                        "kind": "ReplicaSet",
                        "name": "my-app-2781155866",
                        "uid": "4822986e-b181-11e6-b21f-42010a840181"
                    }
                ],
                "resourceVersion": "182846",
                "selfLink": "/api/v1/namespaces/default/pods/my-app-2781155866-wqtqt",
                "uid": "72d911fd-b1e4-11e6-848f-42010a840181"

kubelet marked the pod as seen on 11/24: kubernetes.io/config.seen": "2016-11-24T01:24:05.362036637Z"

There are only a few places where this update could've been dropped by kubelet. I checked the code and found a regression that caused by a change in #28570. Will send a fix later.

saad-ali · 2016-11-29T22:35:59Z

kubelet log is full of errors from the volume manager complaining about non-existent secrets:

Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 17:25:34.230648    2228 reconciler.go:188] UnmountVolume operation started for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (spec.Name: "default-token-r4hlp") from pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181").
Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: E1129 17:25:34.232112    2228 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp\" (\"fee1aef7-b1ab-11e6-848f-42010a840181\")" failed. No retries permitted until 2016-11-29 17:27:34.232087714 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (volume.spec.Name: "default-token-r4hlp") pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181") with: IsLikelyNotMountPoint("/var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp"): stat /var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp: no such file or directory

/cc @kubernetes/sig-storage @saad-ali @jingxu97 to see what could've caused this.

This may or may not be related to the original issue. But worth looking at. Opened up #37657 to investigate

dchen1107 · 2016-11-29T22:50:47Z

What @yujuhong found at #37447 (comment) is a real release blocker issue, and she is going to address it separately. I filed a separate issue #37658 to track that one.

But we still need to reproduce the original issue here. @dashpole is going to help us on this. Thanks!

gmarek · 2016-11-30T08:20:23Z

@dchen1107 @dashpole @yujuhong To reproduce the issue you might want to revert #37293 and #37379 (fix for the scheduler bug) and follow instructions linked by @fgrzadkowski - this should make this issue more visible. Other way to reproduce this is to run multiple schedulers, so that we'll depend on Kubelet to settle conflicts, but this seems to be harder.

wojtek-t · 2016-11-30T08:27:27Z

@dchen1107 @dashpole @yujuhong - actually, you would need to locally revert 2 PRs:
#37293 and #37379
[those 2 fixed the bug in scheduler that was mitigating the problems in kubelet].

dashpole · 2016-11-30T17:33:40Z

Do we have any idea why kubelet restarted?

yujuhong · 2016-11-30T17:55:11Z

@dchen1107 @dashpole @yujuhong - actually, you would need to locally revert 2 PRs:
#37293 and #37379
[those 2 fixed the bug in scheduler that was mitigating the problems in kubelet].

@wojtek-t @gmarek, just trying to make sure I understand the effect of the bug. Was the bug causing scheduler to continue assign pods even though kubelet has already maxed out its capacity (and therefore has to reject the pods)? How fast is the rate of scheduling pods to the node?

Can we reproduce this effect with a simpler setting, e.g., manually assigning pods to a node by creating a replication controller with a fixed NodeName in the pod template?

upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.

@fgrzadkowski @wojtek-t @gmarek the cluster I looked at had thousands of the OutOfcpu pods. I did a somewhat more extreme experiment. I created an RC with 1200 replicas all assigned to one node. What I found was that even though kubelet rejected all the pods internally, it couldn't send out the status update at a rate that is higher than the rate RC was creating the new pods. This was simply caused by the low QPS of the apiserver client. There were always a high number of pending pods around. If one restarts kubelet during that moment, kubelet would admit the pods again because their status weren't persisted in the apiserver yet. Would this explain what you guys have observed?

dashpole · 2016-11-30T18:47:41Z

I was unable to reproduce this issue by doing the following:
I reverted #37293 and #37379.
I created a cluster using
./cluster/kube-up.sh
I created a ReplicationController with the same spec as the app-simple.yaml from #34772, and assigned 20 pods to one node in the new cluster.
I created the replication controller with the command: kubectl create -f ./replication.yaml.
I verified that it tried to schedule lots of pods to the single node using kubectl describe replicationcontroller

Replicas:	20 current / 20 desired
Pods Status:	3 Running / 17 Waiting / 0 Succeeded / 2813 Failed

I ssh'd into the node where all pods were being scheduled to, and verified that it had three stress docker containers running using docker ps
I killed the kubelet process, and waited for the kubelet to be restarted.
From my local terminal, I repeatedly ran kubectl get pods --show-all | grep Pending, and verified that all pending pods are less than 20 seconds old.
I was unable, using this process, to get a pod "stuck" in a pending state for an extended period of time.

dashpole · 2016-11-30T19:00:20Z

I think i was able to reproduce by restarting kubelet a few more times...

wojtek-t · 2016-11-30T19:00:30Z

@wojtek-t @gmarek, just trying to make sure I understand the effect of the bug. Was the bug causing scheduler to continue assign pods even though kubelet has already maxed out its capacity (and therefore has to reject the pods)? How fast is the rate of scheduling pods to the node?

@yujuhong - exactly. It was scheduling the next pod, when the previous one was rejected by kubelet. So something like one every few seconds.

dashpole · 2016-11-30T19:34:39Z

I am going to try and see if the issue still exists after the fix from @yujuhong #37661

gmarek · 2016-11-30T19:35:15Z

I don't remember what caused Kubelet restart, but I don't think it's the issue - Kubelet will restart from time to time, and it needs to survive that.

@yujuhong - as @wojtek-t wrote, we Scheduler won't assign more than one, maybe two superfluous Pods in the sam time, so it's rather unlikely that it's a API server client throttling. The problem @fgrzadkowski observed was that after kubelet restart it processed all rejected pods again. This should be easy to check by repeating @yujuhong experiment with smaller number of Pods and restarting the Kubelet.

dashpole · 2016-11-30T20:12:34Z

@yujuhong 's fix appears to prevent pods from being permanently pending. While all pods (including terminated pods) are re-added to the kubelet's internal pod manager after restarting, it does not appear that it "processes" rejected pods a second time.

yujuhong · 2016-11-30T20:41:26Z

@wojtek-t @gmarek thanks for the clarification. I think we need an explicit check for the pod status before the admission process. I've added that in #37661 to keep the number of cherrypick PRs small.

yujuhong · 2016-11-30T22:33:54Z

I think we need an explicit check for the pod status before the admission process. I've added that in #37661 to keep the number of cherrypick PRs small.

BTW, kubelet may re-reject the pod in this case, but it should never restart (i.e., actually creating containers for) a terminated pod. I think this is a much minor problem and definitely not a release blocker.

fgrzadkowski · 2016-12-01T12:48:59Z

Thank you so much for investigating this issue! You rock :)

dims · 2016-12-01T16:44:27Z

@yujuhong @fgrzadkowski looks like #37661 merged, so can we close out this issue?

yujuhong · 2016-12-01T19:18:41Z

Will close once the cherrypick (#37840) is merged

saad-ali · 2016-12-02T23:39:02Z

@yujuhong can you verify fix in release-1.5 and close issue?

fgrzadkowski added area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 24, 2016

fgrzadkowski added this to the v1.5 milestone Nov 24, 2016

dims assigned dchen1107 and wojtek-t Nov 27, 2016

wojtek-t removed their assignment Nov 28, 2016

dchen1107 assigned dashpole Nov 29, 2016

saad-ali mentioned this issue Nov 29, 2016

UnmountVolume failed for volume secret volume with "IsLikelyNotMountPoint(...): stat...no such file or directory" #37657

Closed

dchen1107 mentioned this issue Nov 29, 2016

Pods rejected by Kubelet might stay at pending phase forever #37658

Closed

yujuhong closed this as completed Dec 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upon restart, kubelet tries to start all pods including terminated ones #37447

Upon restart, kubelet tries to start all pods including terminated ones #37447

fgrzadkowski commented Nov 24, 2016

fgrzadkowski commented Nov 24, 2016

fgrzadkowski commented Nov 24, 2016

dims commented Nov 27, 2016

yujuhong commented Nov 29, 2016

dchen1107 commented Nov 29, 2016

fgrzadkowski commented Nov 29, 2016 via email

yujuhong commented Nov 29, 2016

yujuhong commented Nov 29, 2016

saad-ali commented Nov 29, 2016

dchen1107 commented Nov 29, 2016

gmarek commented Nov 30, 2016 •

edited

Loading

wojtek-t commented Nov 30, 2016

dashpole commented Nov 30, 2016

yujuhong commented Nov 30, 2016

dashpole commented Nov 30, 2016 •

edited

Loading

dashpole commented Nov 30, 2016

wojtek-t commented Nov 30, 2016

dashpole commented Nov 30, 2016

gmarek commented Nov 30, 2016

dashpole commented Nov 30, 2016

yujuhong commented Nov 30, 2016

yujuhong commented Nov 30, 2016 •

edited

Loading

fgrzadkowski commented Dec 1, 2016

dims commented Dec 1, 2016

yujuhong commented Dec 1, 2016

saad-ali commented Dec 2, 2016

Upon restart, kubelet tries to start all pods including terminated ones #37447

Upon restart, kubelet tries to start all pods including terminated ones #37447

Comments

fgrzadkowski commented Nov 24, 2016

fgrzadkowski commented Nov 24, 2016

fgrzadkowski commented Nov 24, 2016

dims commented Nov 27, 2016

yujuhong commented Nov 29, 2016

dchen1107 commented Nov 29, 2016

fgrzadkowski commented Nov 29, 2016 via email

yujuhong commented Nov 29, 2016

yujuhong commented Nov 29, 2016

saad-ali commented Nov 29, 2016

dchen1107 commented Nov 29, 2016

gmarek commented Nov 30, 2016 • edited Loading

wojtek-t commented Nov 30, 2016

dashpole commented Nov 30, 2016

yujuhong commented Nov 30, 2016

dashpole commented Nov 30, 2016 • edited Loading

dashpole commented Nov 30, 2016

wojtek-t commented Nov 30, 2016

dashpole commented Nov 30, 2016

gmarek commented Nov 30, 2016

dashpole commented Nov 30, 2016

yujuhong commented Nov 30, 2016

yujuhong commented Nov 30, 2016 • edited Loading

fgrzadkowski commented Dec 1, 2016

dims commented Dec 1, 2016

yujuhong commented Dec 1, 2016

saad-ali commented Dec 2, 2016

gmarek commented Nov 30, 2016 •

edited

Loading

dashpole commented Nov 30, 2016 •

edited

Loading

yujuhong commented Nov 30, 2016 •

edited

Loading