kube-apiserver occasionally terminates on cluster boot and doesn't get restarted

**Environmental Info:**
RKE2 Version: 
rke2 version v1.31.1+rke2r1 (909d20d6a28cd7656b7177190f06f69f57927613)
go version go1.22.6 X:boringcrypto

Node(s) CPU architecture, OS, and Version: 
Rocky 9
Linux 5.14.0-427.24.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 8 17:47:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Running in AWS with 3 servers and 7 agents. Using cilium. Servers are 2 core 4 GB ram.

**Describe the bug:**
Our cluster is shutdown overnight for cost saving and started back up each morning. Twice in the last month a server node has failed to start properly and this has prevented some agents nodes from joining, despite the other 2 servers being marked as healthy. We were previously on v1.29.3+rke2r1 and never had this issue after running it for several months with the same shutdown behaviour.

Looking through the broken server (server0), kube-apiserver gets terminated during rke2-server startup which causes numerous other issues. `ps aux | grep kube-apiserver` returns nothing on server0.

**Steps To Reproduce:**

- Installed RKE2 on AWS using https://github.com/lablabs/ansible-role-rke2.
- Shutdown all cluster nodes
- Turn on all cluster nodes 

**Expected behavior:**
The cluster starts up successfully.

The cluster is not severely degraded from 1 out 3 server nodes having issues.

**Actual behavior:**
The cluster does not always start up successfully.

The cluster is severely degraded from 1 out 3 server nodes having issues.

**Additional context / logs:**
etcd appears healthy. On server0 with 2456 being the pid of etcd

```sh
$ nsenter --target 2456 --mount  --net --pid --uts -- etcdctl \
  --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.23.141:2379 | 12659cee018f1be2 |  3.5.13 |   52 MB |      true |      false |       248 |   71361783 |           71361783 |        |
|  https://10.128.23.66:2379 | 857d2c9e2293e976 |  3.5.13 |   52 MB |     false |      false |       248 |   71361783 |           71361783 |        |
| https://10.128.21.171:2379 | f699167f1021d729 |  3.5.13 |   52 MB |     false |      false |       248 |   71361783 |           71361783 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
```

kube-apiserver on server1 & server2 are filled with `authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]`.

Cilium is failing to start on server0 but with kube-apiserver down, I assume nothing will.

I have found this comment around load balancer health checks https://github.com/rancher/rke2/issues/5557#issuecomment-1981691021 which we are not doing (but will implement) but I don't think it is the issue as etcd appears healthily.

Looking at kube-apiserver on server0

`/var/log/containers/kube-apiserver-k8s-server0.DOMAIN_kube-system_kube-apiserver-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.log -> /var/log/pods/kube-system_kube-apiserver-k8s-server0.DOMAIN_80bcfddfbe1c3d0b4d90fe8d28b7a51c/kube-apiserver/22.log` (attached as k8s-server0-not-ready-kube-apiserver-pod.log)


The kube-apiserver on server0 is container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171. In /var/log/messages I can see

```
Nov  7 06:02:02 k8s-server0 systemd[1]: Started libcontainer container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.
Nov  7 06:02:26 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.choiNb.mount: Deactivated successfully.
Nov  7 06:02:31 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.BKojHd.mount: Deactivated successfully.
Nov  7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Deactivated successfully.
Nov  7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Consumed 9.180s CPU time.
```

Weirdly, critctl reports it as running on server0

```sh
$ crictl ps --name  kube-apiserver
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
a4d23bea00f67       1b44d478ec444       9 hours ago         Running             kube-apiserver      22                  093e8959f7b4d       kube-apiserver-k8s-server0.DOMAIN

$ crictl inspect a4d23bea00f67 | jq .info.pid
2372

$ ps aux | grep 2372
root      200950  0.0  0.0   6408  2176 pts/0    S+   15:22   0:00 grep --color=auto 2372
```

Containerd logs on server0 say it has been killed but then something is trying to exec it

```
time="2024-11-07T06:02:41.609408926Z" level=info msg="Kill container \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\""
time="2024-11-07T06:02:41.987115383Z" level=error msg="ExecSync for \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\" failed" error="failed to exec in container: failed to start exec \"36daae2d0ca0aa2b27f83e18036018231cccee0bdfb824c5d11e9d9ed4020f3c\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
```

I cannot stop the zombie container
```
 crictl stop a4d23bea00f67
E1107 16:15:02.506585  219370 remote_runtime.go:366] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="a4d23bea00f67"
FATA[0002] stopping the container "a4d23bea00f67": rpc error: code = DeadlineExceeded desc = context deadline exceeded
```

Running `systemctl restart rke2-server` on server0 fixes it and all nodes immediately join but we would like to understand why this is happening.

Log attachments
[k8s-agent0-not-ready-rke2-agent.log](https://github.com/user-attachments/files/17667041/k8s-agent0-not-ready-rke2-agent.log)
[k8s-server0-not-ready-containerd.log](https://github.com/user-attachments/files/17667042/k8s-server0-not-ready-containerd.log)
[k8s-server0-not-ready-rke2-server.log](https://github.com/user-attachments/files/17667044/k8s-server0-not-ready-rke2-server.log)
[k8s-server1-rke2-server.log](https://github.com/user-attachments/files/17667045/k8s-server1-rke2-server.log)
[k8s-server2-rke2-server.log](https://github.com/user-attachments/files/17667046/k8s-server2-rke2-server.log)
[k8s-agent6-rke2-agent.log](https://github.com/user-attachments/files/17667047/k8s-agent6-rke2-agent.log)
[k8s-server0-not-ready-kube-apiserver-pod.log](https://github.com/user-attachments/files/17667048/k8s-server0-not-ready-kube-apiserver-pod.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver occasionally terminates on cluster boot and doesn't get restarted #7241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development