Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upon restart, kubelet tries to start all pods including terminated ones #37447

Closed
fgrzadkowski opened this issue Nov 24, 2016 · 26 comments
Closed
Assignees
Labels
area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@fgrzadkowski
Copy link
Contributor

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

Kubernetes version (use kubectl version):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.6", GitCommit:"ae4550cc9c89a593bcda6678df201db1b208133b", GitTreeState:"clean", BuildDate:"2016-08-26T18:13:23Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): default
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened:
I was trying to reproduce #34772 using these instructions. Due to bug in scheduler we were assigning too many pods per node. In most cases kubelet was correctly discarding them by marking them as OutOfcpu. However two strange things happened:

  • upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.
  • for two pods that were incorrectly assigned (due to scheduler bug) during kubelet restart kubelet completely ignored them - they are still pending, with node assigned for 12h.

I have a cluster that I can add someone to investigate.

What you expected to happen:

  • OutOfcpu pods should not be retried after restart
  • new pods, assigned during kubelet restart, should be discarded with OutOfcpu error.

How to reproduce it (as minimally and precisely as possible):

Run these instructions.

Anything else do we need to know:

Marking as release-blocker until it's triaged by node team.

@fgrzadkowski fgrzadkowski added area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node. labels Nov 24, 2016
@fgrzadkowski fgrzadkowski added this to the v1.5 milestone Nov 24, 2016
@fgrzadkowski
Copy link
Contributor Author

I added @dchen1107 and @yujuhong to my project to let them investigate further.

GKE cluster: test-cluster-1 in europe-west1-b
Node: gke-test-cluster-1-default-pool-44d8ae2f-kpn9

@fgrzadkowski
Copy link
Contributor Author

@gmarek @wojtek-t @kubernetes/sig-node

@dims
Copy link
Member

dims commented Nov 27, 2016

Picked a couple of folks and assigned them as this is a release-blocker, please assign someone else if appropriate.

@wojtek-t wojtek-t removed their assignment Nov 28, 2016
@yujuhong
Copy link
Contributor

@fgrzadkowski, we need the project name/id to access the cluster.

upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.

Why did kubelet restart? Did it ever have a chance to persist the pod status to the apiserver?

for two pods that were incorrectly assigned (due to scheduler bug) during kubelet restart kubelet completely ignored them - they are still pending, with node assigned for 12h.

What's the output of kubectl describe pods for these two pending pods? Did kubelet indicate any reason why the pod was pending?

@dchen1107
Copy link
Member

@fgrzadkowski I couldn't access your cluster without projectid. Also there is no notification email nowadays when you added us to your project.

I also couldn't reproduce the problem by simply restarting Kubelet.

@fgrzadkowski
Copy link
Contributor Author

fgrzadkowski commented Nov 29, 2016 via email

@yujuhong
Copy link
Contributor

I did some basic checkup in the cluster, but will be offline for the next hour or two. Posting the findings below so other people can pick up if they want.

  1. I can't find logs around the time kubelet restarted. It's possible that the log is too old and gone forever. This means I couldn't examine the kubelet pod admission steps. @fgrzadkowski, I still don't know why kubelet restarted in the first place. If you could describe the situation more, it might help.

  2. The two pending pods are not seeing by kubelet at all. I check the kubelet /pods endpoint (where kubelet stores the pod specs from sources such as the apiserver), and those two pods were nowhere to be found. I initially suspected that the kubelet syncloop goroutine might be stuck and kubelet's not working correctly, but the health check seems fine. I've also tried the two things to verify that kubelet's functioning ok (since gke only has the less-informative V2 kubelet logs):

    • Kill the fluentd infra container. Observe that kubelet restart the pod. This proves that kubelet can detect state changes and wakes up the pod worker to perform syncs.
    • Edit the fluentd pod status using kubectl edit. Observe that kubelet rectify the pod status. This proves that kubelet can receives updates from the apiserver.

This shows that only the two pods were not delivered to kubelet over watch. I don't know what the current relisting period is, and whether the particular kubelet has gone through the resliting. One could probably poke this more by modifying the spec of the pending pods to see if it triggers an update.

  1. kubelet log is full of errors from the volume manager complaining about non-existent secrets:
Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 17:25:34.230648    2228 reconciler.go:188] UnmountVolume operation started for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (spec.Name: "default-token-r4hlp") from pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181").
Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: E1129 17:25:34.232112    2228 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp\" (\"fee1aef7-b1ab-11e6-848f-42010a840181\")" failed. No retries permitted until 2016-11-29 17:27:34.232087714 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (volume.spec.Name: "default-token-r4hlp") pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181") with: IsLikelyNotMountPoint("/var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp"): stat /var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp: no such file or directory

/cc @kubernetes/sig-storage @saad-ali @jingxu97 to see what could've caused this.

@yujuhong
Copy link
Contributor

For (2) in my previous comment, I've verified that this is a kubelet bug (i.e., not related to the apiserver watch). What I did was:

  • Edit the image used in the pending pod (kubectl edit pod)
  • Get a UPDATE message from the kubelet log:
Nov 29 21:44:14 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 21:44:14.988756    2228 kubelet.go:2300] SyncLoop (UPDATE, "api"): "my-app-2781155866-wqtqt_default(72d911fd-b1e4-11e6-848f-42010a840181)"

What this means is that kubelet itself thinks this is not a new pod, but merely an update. To further confirm this, I access kubelet's /pods endpoint:

            "metadata": {
                "annotations": {
                    "kubernetes.io/config.seen": "2016-11-24T01:24:05.362036637Z",
                    "kubernetes.io/config.source": "api",
                    "kubernetes.io/created-by": "{\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"ReplicaSet\",\"namespace\":\"default\",\"name\":\"my-app-2781155866\",\"uid\":\"48
22986e-b181-11e6-b21f-42010a840181\",\"apiVersion\":\"extensions\",\"resourceVersion\":\"182642\"}}\n"
                },
                "creationTimestamp": "2016-11-24T01:22:21Z",
                "generateName": "my-app-2781155866-",
                "labels": {
                    "name": "my-app",
                    "pod-template-hash": "2781155866"
                },
                "name": "my-app-2781155866-wqtqt",
                "namespace": "default",
                "ownerReferences": [
                    {
                        "apiVersion": "extensions/v1beta1",
                        "controller": true,
                        "kind": "ReplicaSet",
                        "name": "my-app-2781155866",
                        "uid": "4822986e-b181-11e6-b21f-42010a840181"
                    }
                ],
                "resourceVersion": "182846",
                "selfLink": "/api/v1/namespaces/default/pods/my-app-2781155866-wqtqt",
                "uid": "72d911fd-b1e4-11e6-848f-42010a840181"

kubelet marked the pod as seen on 11/24: kubernetes.io/config.seen": "2016-11-24T01:24:05.362036637Z"

There are only a few places where this update could've been dropped by kubelet. I checked the code and found a regression that caused by a change in #28570. Will send a fix later.

@saad-ali
Copy link
Member

kubelet log is full of errors from the volume manager complaining about non-existent secrets:

Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: I1129 17:25:34.230648    2228 reconciler.go:188] UnmountVolume operation started for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (spec.Name: "default-token-r4hlp") from pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181").
Nov 29 17:25:34 gke-test-cluster-1-default-pool-44d8ae2f-kpn9 kubelet[2228]: E1129 17:25:34.232112    2228 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp\" (\"fee1aef7-b1ab-11e6-848f-42010a840181\")" failed. No retries permitted until 2016-11-29 17:27:34.232087714 +0000 UTC (durationBeforeRetry 2m0s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/fee1aef7-b1ab-11e6-848f-42010a840181-default-token-r4hlp" (volume.spec.Name: "default-token-r4hlp") pod "fee1aef7-b1ab-11e6-848f-42010a840181" (UID: "fee1aef7-b1ab-11e6-848f-42010a840181") with: IsLikelyNotMountPoint("/var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp"): stat /var/lib/kubelet/pods/fee1aef7-b1ab-11e6-848f-42010a840181/volumes/kubernetes.io~secret/default-token-r4hlp: no such file or directory

/cc @kubernetes/sig-storage @saad-ali @jingxu97 to see what could've caused this.

This may or may not be related to the original issue. But worth looking at. Opened up #37657 to investigate

@dchen1107
Copy link
Member

What @yujuhong found at #37447 (comment) is a real release blocker issue, and she is going to address it separately. I filed a separate issue #37658 to track that one.

But we still need to reproduce the original issue here. @dashpole is going to help us on this. Thanks!

@gmarek
Copy link
Contributor

gmarek commented Nov 30, 2016

@dchen1107 @dashpole @yujuhong To reproduce the issue you might want to revert #37293 and #37379 (fix for the scheduler bug) and follow instructions linked by @fgrzadkowski - this should make this issue more visible. Other way to reproduce this is to run multiple schedulers, so that we'll depend on Kubelet to settle conflicts, but this seems to be harder.

@wojtek-t
Copy link
Member

@dchen1107 @dashpole @yujuhong - actually, you would need to locally revert 2 PRs:
#37293 and #37379
[those 2 fixed the bug in scheduler that was mitigating the problems in kubelet].

@dashpole
Copy link
Contributor

Do we have any idea why kubelet restarted?

@yujuhong
Copy link
Contributor

@dchen1107 @dashpole @yujuhong - actually, you would need to locally revert 2 PRs:
#37293 and #37379
[those 2 fixed the bug in scheduler that was mitigating the problems in kubelet].

@wojtek-t @gmarek, just trying to make sure I understand the effect of the bug. Was the bug causing scheduler to continue assign pods even though kubelet has already maxed out its capacity (and therefore has to reject the pods)? How fast is the rate of scheduling pods to the node?

Can we reproduce this effect with a simpler setting, e.g., manually assigning pods to a node by creating a replication controller with a fixed NodeName in the pod template?

upon kubelet restart, it tried to re-add all such OutOfcpu pods, run predicate check for them etc.

@fgrzadkowski @wojtek-t @gmarek the cluster I looked at had thousands of the OutOfcpu pods. I did a somewhat more extreme experiment. I created an RC with 1200 replicas all assigned to one node. What I found was that even though kubelet rejected all the pods internally, it couldn't send out the status update at a rate that is higher than the rate RC was creating the new pods. This was simply caused by the low QPS of the apiserver client. There were always a high number of pending pods around. If one restarts kubelet during that moment, kubelet would admit the pods again because their status weren't persisted in the apiserver yet. Would this explain what you guys have observed?

@dashpole
Copy link
Contributor

dashpole commented Nov 30, 2016

I was unable to reproduce this issue by doing the following:
I reverted #37293 and #37379.
I created a cluster using
./cluster/kube-up.sh
I created a ReplicationController with the same spec as the app-simple.yaml from #34772, and assigned 20 pods to one node in the new cluster.
I created the replication controller with the command: kubectl create -f ./replication.yaml.
I verified that it tried to schedule lots of pods to the single node using kubectl describe replicationcontroller

Replicas:	20 current / 20 desired
Pods Status:	3 Running / 17 Waiting / 0 Succeeded / 2813 Failed

I ssh'd into the node where all pods were being scheduled to, and verified that it had three stress docker containers running using docker ps
I killed the kubelet process, and waited for the kubelet to be restarted.
From my local terminal, I repeatedly ran kubectl get pods --show-all | grep Pending, and verified that all pending pods are less than 20 seconds old.
I was unable, using this process, to get a pod "stuck" in a pending state for an extended period of time.

@dashpole
Copy link
Contributor

I think i was able to reproduce by restarting kubelet a few more times...

@wojtek-t
Copy link
Member

@wojtek-t @gmarek, just trying to make sure I understand the effect of the bug. Was the bug causing scheduler to continue assign pods even though kubelet has already maxed out its capacity (and therefore has to reject the pods)? How fast is the rate of scheduling pods to the node?

@yujuhong - exactly. It was scheduling the next pod, when the previous one was rejected by kubelet. So something like one every few seconds.

@dashpole
Copy link
Contributor

I am going to try and see if the issue still exists after the fix from @yujuhong #37661

@gmarek
Copy link
Contributor

gmarek commented Nov 30, 2016

I don't remember what caused Kubelet restart, but I don't think it's the issue - Kubelet will restart from time to time, and it needs to survive that.

@yujuhong - as @wojtek-t wrote, we Scheduler won't assign more than one, maybe two superfluous Pods in the sam time, so it's rather unlikely that it's a API server client throttling. The problem @fgrzadkowski observed was that after kubelet restart it processed all rejected pods again. This should be easy to check by repeating @yujuhong experiment with smaller number of Pods and restarting the Kubelet.

@dashpole
Copy link
Contributor

@yujuhong 's fix appears to prevent pods from being permanently pending. While all pods (including terminated pods) are re-added to the kubelet's internal pod manager after restarting, it does not appear that it "processes" rejected pods a second time.

@yujuhong
Copy link
Contributor

@wojtek-t @gmarek thanks for the clarification. I think we need an explicit check for the pod status before the admission process. I've added that in #37661 to keep the number of cherrypick PRs small.

@yujuhong
Copy link
Contributor

yujuhong commented Nov 30, 2016

I think we need an explicit check for the pod status before the admission process. I've added that in #37661 to keep the number of cherrypick PRs small.

BTW, kubelet may re-reject the pod in this case, but it should never restart (i.e., actually creating containers for) a terminated pod. I think this is a much minor problem and definitely not a release blocker.

@fgrzadkowski
Copy link
Contributor Author

Thank you so much for investigating this issue! You rock :)

@dims
Copy link
Member

dims commented Dec 1, 2016

@yujuhong @fgrzadkowski looks like #37661 merged, so can we close out this issue?

@yujuhong
Copy link
Contributor

yujuhong commented Dec 1, 2016

Will close once the cherrypick (#37840) is merged

@saad-ali
Copy link
Member

saad-ali commented Dec 2, 2016

@yujuhong can you verify fix in release-1.5 and close issue?

@yujuhong yujuhong closed this as completed Dec 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-blocker sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

8 participants