kube-apiserver occasionally terminates on cluster boot and doesn't get restarted #7241
Description
Environmental Info:
RKE2 Version:
rke2 version v1.31.1+rke2r1 (909d20d)
go version go1.22.6 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
Rocky 9
Linux 5.14.0-427.24.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 8 17:47:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
Running in AWS with 3 servers and 7 agents. Using cilium. Servers are 2 core 4 GB ram.
Describe the bug:
Our cluster is shutdown overnight for cost saving and started back up each morning. Twice in the last month a server node has failed to start properly and this has prevented some agents nodes from joining, despite the other 2 servers being marked as healthy. We were previously on v1.29.3+rke2r1 and never had this issue after running it for several months with the same shutdown behaviour.
Looking through the broken server (server0), kube-apiserver gets terminated during rke2-server startup which causes numerous other issues. ps aux | grep kube-apiserver
returns nothing on server0.
Steps To Reproduce:
- Installed RKE2 on AWS using https://github.com/lablabs/ansible-role-rke2.
- Shutdown all cluster nodes
- Turn on all cluster nodes
Expected behavior:
The cluster starts up successfully.
The cluster is not severely degraded from 1 out 3 server nodes having issues.
Actual behavior:
The cluster does not always start up successfully.
The cluster is severely degraded from 1 out 3 server nodes having issues.
Additional context / logs:
etcd appears healthy. On server0 with 2456 being the pid of etcd
$ nsenter --target 2456 --mount --net --pid --uts -- etcdctl \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status --cluster --write-out=table;
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.23.141:2379 | 12659cee018f1be2 | 3.5.13 | 52 MB | true | false | 248 | 71361783 | 71361783 | |
| https://10.128.23.66:2379 | 857d2c9e2293e976 | 3.5.13 | 52 MB | false | false | 248 | 71361783 | 71361783 | |
| https://10.128.21.171:2379 | f699167f1021d729 | 3.5.13 | 52 MB | false | false | 248 | 71361783 | 71361783 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
kube-apiserver on server1 & server2 are filled with authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]
.
Cilium is failing to start on server0 but with kube-apiserver down, I assume nothing will.
I have found this comment around load balancer health checks #5557 (comment) which we are not doing (but will implement) but I don't think it is the issue as etcd appears healthily.
Looking at kube-apiserver on server0
/var/log/containers/kube-apiserver-k8s-server0.DOMAIN_kube-system_kube-apiserver-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.log -> /var/log/pods/kube-system_kube-apiserver-k8s-server0.DOMAIN_80bcfddfbe1c3d0b4d90fe8d28b7a51c/kube-apiserver/22.log
(attached as k8s-server0-not-ready-kube-apiserver-pod.log)
The kube-apiserver on server0 is container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171. In /var/log/messages I can see
Nov 7 06:02:02 k8s-server0 systemd[1]: Started libcontainer container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.
Nov 7 06:02:26 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.choiNb.mount: Deactivated successfully.
Nov 7 06:02:31 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.BKojHd.mount: Deactivated successfully.
Nov 7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Deactivated successfully.
Nov 7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Consumed 9.180s CPU time.
Weirdly, critctl reports it as running on server0
$ crictl ps --name kube-apiserver
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
a4d23bea00f67 1b44d478ec444 9 hours ago Running kube-apiserver 22 093e8959f7b4d kube-apiserver-k8s-server0.DOMAIN
$ crictl inspect a4d23bea00f67 | jq .info.pid
2372
$ ps aux | grep 2372
root 200950 0.0 0.0 6408 2176 pts/0 S+ 15:22 0:00 grep --color=auto 2372
Containerd logs on server0 say it has been killed but then something is trying to exec it
time="2024-11-07T06:02:41.609408926Z" level=info msg="Kill container \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\""
time="2024-11-07T06:02:41.987115383Z" level=error msg="ExecSync for \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\" failed" error="failed to exec in container: failed to start exec \"36daae2d0ca0aa2b27f83e18036018231cccee0bdfb824c5d11e9d9ed4020f3c\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
I cannot stop the zombie container
crictl stop a4d23bea00f67
E1107 16:15:02.506585 219370 remote_runtime.go:366] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="a4d23bea00f67"
FATA[0002] stopping the container "a4d23bea00f67": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Running systemctl restart rke2-server
on server0 fixes it and all nodes immediately join but we would like to understand why this is happening.
Log attachments
k8s-agent0-not-ready-rke2-agent.log
k8s-server0-not-ready-containerd.log
k8s-server0-not-ready-rke2-server.log
k8s-server1-rke2-server.log
k8s-server2-rke2-server.log
k8s-agent6-rke2-agent.log
k8s-server0-not-ready-kube-apiserver-pod.log