-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upon restart, kubelet tries to start all pods including terminated ones #37447
Comments
I added @dchen1107 and @yujuhong to my project to let them investigate further. GKE cluster: test-cluster-1 in europe-west1-b |
Picked a couple of folks and assigned them as this is a release-blocker, please assign someone else if appropriate. |
@fgrzadkowski, we need the project name/id to access the cluster.
Why did kubelet restart? Did it ever have a chance to persist the pod status to the apiserver?
What's the output of |
@fgrzadkowski I couldn't access your cluster without projectid. Also there is no notification email nowadays when you added us to your project. I also couldn't reproduce the problem by simply restarting Kubelet. |
Sent over email.
…--
Filip
On Tue, Nov 29, 2016 at 3:03 AM, Dawn Chen ***@***.***> wrote:
@fgrzadkowski <https://github.com/fgrzadkowski> I couldn't access your
cluster without projectid. Also there is no notification email nowadays
when you added us to your project.
I also couldn't reproduce the problem by simply restarting Kubelet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37447 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKUcdqYHZZIRvkMiatn9-_sgcNAONIR0ks5rC4frgaJpZM4K7pY0>
.
|
I did some basic checkup in the cluster, but will be offline for the next hour or two. Posting the findings below so other people can pick up if they want.
This shows that only the two pods were not delivered to kubelet over watch. I don't know what the current relisting period is, and whether the particular kubelet has gone through the resliting. One could probably poke this more by modifying the spec of the pending pods to see if it triggers an update.
/cc @kubernetes/sig-storage @saad-ali @jingxu97 to see what could've caused this. |
For (2) in my previous comment, I've verified that this is a kubelet bug (i.e., not related to the apiserver watch). What I did was:
What this means is that kubelet itself thinks this is not a new pod, but merely an update. To further confirm this, I access kubelet's /pods endpoint:
kubelet marked the pod as There are only a few places where this update could've been dropped by kubelet. I checked the code and found a regression that caused by a change in #28570. Will send a fix later. |
This may or may not be related to the original issue. But worth looking at. Opened up #37657 to investigate |
What @yujuhong found at #37447 (comment) is a real release blocker issue, and she is going to address it separately. I filed a separate issue #37658 to track that one. But we still need to reproduce the original issue here. @dashpole is going to help us on this. Thanks! |
@dchen1107 @dashpole @yujuhong To reproduce the issue you might want to revert #37293 and #37379 (fix for the scheduler bug) and follow instructions linked by @fgrzadkowski - this should make this issue more visible. Other way to reproduce this is to run multiple schedulers, so that we'll depend on Kubelet to settle conflicts, but this seems to be harder. |
@dchen1107 @dashpole @yujuhong - actually, you would need to locally revert 2 PRs: |
Do we have any idea why kubelet restarted? |
@wojtek-t @gmarek, just trying to make sure I understand the effect of the bug. Was the bug causing scheduler to continue assign pods even though kubelet has already maxed out its capacity (and therefore has to reject the pods)? How fast is the rate of scheduling pods to the node? Can we reproduce this effect with a simpler setting, e.g., manually assigning pods to a node by creating a replication controller with a fixed
@fgrzadkowski @wojtek-t @gmarek the cluster I looked at had thousands of the OutOfcpu pods. I did a somewhat more extreme experiment. I created an RC with 1200 replicas all assigned to one node. What I found was that even though kubelet rejected all the pods internally, it couldn't send out the status update at a rate that is higher than the rate RC was creating the new pods. This was simply caused by the low QPS of the apiserver client. There were always a high number of pending pods around. If one restarts kubelet during that moment, kubelet would admit the pods again because their status weren't persisted in the apiserver yet. Would this explain what you guys have observed? |
I was unable to reproduce this issue by doing the following:
I ssh'd into the node where all pods were being scheduled to, and verified that it had three stress docker containers running using |
I think i was able to reproduce by restarting kubelet a few more times... |
@yujuhong - exactly. It was scheduling the next pod, when the previous one was rejected by kubelet. So something like one every few seconds. |
I don't remember what caused Kubelet restart, but I don't think it's the issue - Kubelet will restart from time to time, and it needs to survive that. @yujuhong - as @wojtek-t wrote, we Scheduler won't assign more than one, maybe two superfluous Pods in the sam time, so it's rather unlikely that it's a API server client throttling. The problem @fgrzadkowski observed was that after kubelet restart it processed all rejected pods again. This should be easy to check by repeating @yujuhong experiment with smaller number of Pods and restarting the Kubelet. |
@yujuhong 's fix appears to prevent pods from being permanently pending. While all pods (including terminated pods) are re-added to the kubelet's internal pod manager after restarting, it does not appear that it "processes" rejected pods a second time. |
BTW, kubelet may re-reject the pod in this case, but it should never restart (i.e., actually creating containers for) a terminated pod. I think this is a much minor problem and definitely not a release blocker. |
Thank you so much for investigating this issue! You rock :) |
@yujuhong @fgrzadkowski looks like #37661 merged, so can we close out this issue? |
Will close once the cherrypick (#37840) is merged |
@yujuhong can you verify fix in release-1.5 and close issue? |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Kubernetes version (use
kubectl version
):Environment:
uname -a
):What happened:
I was trying to reproduce #34772 using these instructions. Due to bug in scheduler we were assigning too many pods per node. In most cases kubelet was correctly discarding them by marking them as
OutOfcpu
. However two strange things happened:OutOfcpu
pods, run predicate check for them etc.I have a cluster that I can add someone to investigate.
What you expected to happen:
OutOfcpu
pods should not be retried after restartOutOfcpu
error.How to reproduce it (as minimally and precisely as possible):
Run these instructions.
Anything else do we need to know:
Marking as
release-blocker
until it's triaged by node team.The text was updated successfully, but these errors were encountered: