-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubeadm cluster breaks persistently on out-of-order upgrade #65562
Comments
@kubernetes/sig-cluster-lifecycle-bugs |
@danderson: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
definitly a bug or some devs f-up. |
It's been well documented since nearly the beginning of the project that the order of upgrade it control-plan then nodes. We could definitely add more warnings, but this order has existed since the beginning of the project. https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade-1-11/ . I'm closing this issue in the main repo, please reopen in - https://github.com/kubernetes/kubeadm |
We're chatting on slack right now to try and discuss options around preventing this problem. |
xref - kubernetes/kubeadm#954 |
Big thanks for the explanation to get things working again, just want to add two notes:
|
@davidkarlsen Run @danderson could you add this step (and maybe also point 1 about touching the config) to your write-up of the workaround? |
#155) * Fixed order of the upgrade process ref: kubernetes/kubernetes#65562 and kubernetes/kubernetes#65863 * fixed ordering again. my bad.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
(arguably WAI, but I have thoughts about that in the "what you expected" section)
What happened:
On my 1.10.5 cluster (kubeadm, deb package for kubelet), in preparation for upgrading to 1.11.0, I accidentally upgraded the kubelets before the control plane (
apt update && apt upgrade
with kubelet version not pinned, results in kubelet+kubeadm 1.11.0 vs. 1.10.5 control plane).After the kubelet upgrade was completed, all kubelets in my cluster were crashlooping due to a missing
/var/lib/kubelet/config.yaml
. This means the control plane was also hard down.I couldn't execute
kubeadm upgrade apply v1.11.0
to fix the version skew, because it needs the control plane to be up.I tried downgrading kubelet to v1.10.5, but the changes made to the systemd drop-in unit were permanent, so kubelet was still crashlooping after downgrade. This is what I believe to be a bug, the fact that downgrading back to a supported set of component versions did not recover the cluster. In other words, "rollback did not roll back".
The release notes just say that "there needs to be a config.yaml for kubelet now", without specifying what goes in that config, how to construct one (either manually or automatically).
I got lucky, and knew just enough kubeadm to bail myself out. In case someone else with this failure mode finds this bug, the way to fix this that worked for me was the following (I can't promise that it works, but it seems to have worked for me) :
apt install kubelet=1.10.5
,kubeadm alpha phase kubelet config write-to-disk --config=/var/lib/kubelet/config.yaml
, thensystemctl restart kubelet
. This should get kubelet started again, and thus the control plane should come back.kubeadm upgrade plan v1.11.0
andkubeadm upgrade apply v1.11.0
to upgrade the control plane.kubeadm alpha phase kubelet config download
to generate the config.yaml from the in-cluster ConfigMap (which is now reachable since you repaired the control plane). You might need asystemctl restart kubelet
to defeat the crashloop backoff timer.What you expected to happen:
I expected Kubernetes to be robust against accidental version skew and out-of-order upgrading. Especially given the current state of managing k8s clusters, the situation where you end up upgrading the kubelet before the control plane is unfortunately common.
By "robust", I don't necessarily mean "should work", since this configuration is clearly outside the supported version skew. However, I think it should not persistently break the cluster. In other words, the only necessary corrective action should have been
apt install kubelet=1.10.5
to downgrade back to a supported {control plane}x{kubelet} version set.At minimum, I expected the Debian package to set things up such that a downgrade goes back to a working configuration. Even better would be if kubelet 1.11 had crashlooped with a useful message, e.g. "oh dear, it looks like you did upgrades out of order, because I don't have a config. Downgrade back to 1.10 and go upgrade the control plane first!" That way, people who don't have a rollback reflex as sharp as mine are still guided in the right direction.
How to reproduce it (as minimally and precisely as possible):
On Debian testing (or ubuntu xenial, failure mode should be identical):
apt install kubelet=1.10.5-00 kubeadm
kubeadm init
on control plane,kubeadm join
on worker nodes as appropriate.kubectl get nodes
works, lists all nodes at 1.10.5.apt install kubelet
to upgrade kubelet to 1.11.0-00.kubectl get nodes
can no longer connect to control plane.journalctl -u kubelet
shows kubelet crashlooping due to lack of config.yaml.apt install kubelet=1.10.5-00
to downgrade kubelet.kubectl get nodes
still broken,journalctl -u kubelet
shows same crashloop.kubectl get nodes
works again, and kubelets have started back up.Environment:
kubectl version
): v1.10.5 -> v1.11.0uname -a
): Linux prod-01 4.16.0-2-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.16.12-1 (2018-05-27) x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: