Skip to content

kube-apiserver occasionally terminates on cluster boot and doesn't get restarted #7241

Closed
@hoo29

Description

Environmental Info:
RKE2 Version:
rke2 version v1.31.1+rke2r1 (909d20d)
go version go1.22.6 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Rocky 9
Linux 5.14.0-427.24.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jul 8 17:47:19 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Running in AWS with 3 servers and 7 agents. Using cilium. Servers are 2 core 4 GB ram.

Describe the bug:
Our cluster is shutdown overnight for cost saving and started back up each morning. Twice in the last month a server node has failed to start properly and this has prevented some agents nodes from joining, despite the other 2 servers being marked as healthy. We were previously on v1.29.3+rke2r1 and never had this issue after running it for several months with the same shutdown behaviour.

Looking through the broken server (server0), kube-apiserver gets terminated during rke2-server startup which causes numerous other issues. ps aux | grep kube-apiserver returns nothing on server0.

Steps To Reproduce:

Expected behavior:
The cluster starts up successfully.

The cluster is not severely degraded from 1 out 3 server nodes having issues.

Actual behavior:
The cluster does not always start up successfully.

The cluster is severely degraded from 1 out 3 server nodes having issues.

Additional context / logs:
etcd appears healthy. On server0 with 2456 being the pid of etcd

$ nsenter --target 2456 --mount  --net --pid --uts -- etcdctl \
  --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
  --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
  --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt endpoint status  --cluster --write-out=table;

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.23.141:2379 | 12659cee018f1be2 |  3.5.13 |   52 MB |      true |      false |       248 |   71361783 |           71361783 |        |
|  https://10.128.23.66:2379 | 857d2c9e2293e976 |  3.5.13 |   52 MB |     false |      false |       248 |   71361783 |           71361783 |        |
| https://10.128.21.171:2379 | f699167f1021d729 |  3.5.13 |   52 MB |     false |      false |       248 |   71361783 |           71361783 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

kube-apiserver on server1 & server2 are filled with authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated].

Cilium is failing to start on server0 but with kube-apiserver down, I assume nothing will.

I have found this comment around load balancer health checks #5557 (comment) which we are not doing (but will implement) but I don't think it is the issue as etcd appears healthily.

Looking at kube-apiserver on server0

/var/log/containers/kube-apiserver-k8s-server0.DOMAIN_kube-system_kube-apiserver-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.log -> /var/log/pods/kube-system_kube-apiserver-k8s-server0.DOMAIN_80bcfddfbe1c3d0b4d90fe8d28b7a51c/kube-apiserver/22.log (attached as k8s-server0-not-ready-kube-apiserver-pod.log)

The kube-apiserver on server0 is container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171. In /var/log/messages I can see

Nov  7 06:02:02 k8s-server0 systemd[1]: Started libcontainer container a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.
Nov  7 06:02:26 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.choiNb.mount: Deactivated successfully.
Nov  7 06:02:31 k8s-server0 systemd[1]: run-containerd-runc-k8s.io-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171-runc.BKojHd.mount: Deactivated successfully.
Nov  7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Deactivated successfully.
Nov  7 06:02:42 k8s-server0 systemd[1]: cri-containerd-a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171.scope: Consumed 9.180s CPU time.

Weirdly, critctl reports it as running on server0

$ crictl ps --name  kube-apiserver
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
a4d23bea00f67       1b44d478ec444       9 hours ago         Running             kube-apiserver      22                  093e8959f7b4d       kube-apiserver-k8s-server0.DOMAIN

$ crictl inspect a4d23bea00f67 | jq .info.pid
2372

$ ps aux | grep 2372
root      200950  0.0  0.0   6408  2176 pts/0    S+   15:22   0:00 grep --color=auto 2372

Containerd logs on server0 say it has been killed but then something is trying to exec it

time="2024-11-07T06:02:41.609408926Z" level=info msg="Kill container \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\""
time="2024-11-07T06:02:41.987115383Z" level=error msg="ExecSync for \"a4d23bea00f6796f0ae87e8779331cd3abb4557a6208ad6ba52666e2ae544171\" failed" error="failed to exec in container: failed to start exec \"36daae2d0ca0aa2b27f83e18036018231cccee0bdfb824c5d11e9d9ed4020f3c\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"

I cannot stop the zombie container

 crictl stop a4d23bea00f67
E1107 16:15:02.506585  219370 remote_runtime.go:366] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="a4d23bea00f67"
FATA[0002] stopping the container "a4d23bea00f67": rpc error: code = DeadlineExceeded desc = context deadline exceeded

Running systemctl restart rke2-server on server0 fixes it and all nodes immediately join but we would like to understand why this is happening.

Log attachments
k8s-agent0-not-ready-rke2-agent.log
k8s-server0-not-ready-containerd.log
k8s-server0-not-ready-rke2-server.log
k8s-server1-rke2-server.log
k8s-server2-rke2-server.log
k8s-agent6-rke2-agent.log
k8s-server0-not-ready-kube-apiserver-pod.log

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions