cilium do not restore some endpoint and leave them stale #37077
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.14.18 and lower than v1.15.0
What happened?
In my environment, after a cilium pod reconstruction on a node, some pods on the node entered the crash state. It was found that some local Pods could not be pinged on the node, and the ping output was similar to the following output:
# ping 172.16.0.13
PING 172.16.0.13 (172.16.0.13) 56(84) bytes of data.
From 172.16.0.1 icmp_seq=1 Time to live exceeded
From 172.16.0.1 icmp_seq=2 Time to live exceeded
I searched cilium endpoint in the cilium-agent
container on the node through commands such as cilium endpoint list | grep $podIP
, and the result was that no corresponding endpoint could be found. But I can capture the log of creating cilium endpoint when the pod is created:
level=debug msg="Endpoint successfully created" containerID=a7fe558a0ace4be4e6bf06895272cf5eecdc9b1adb41b52b0a230a7b12943cf0 eventUUID=e7a3754e-2222-48ad-af43-733d0db20b1a subsys=cilium-cni
Also I found some restore log in cilium-agent
:
2025-01-15T01:35:04.106977887+08:00 stdout F level=info msg="Restored endpoint" endpointID=3718 ipAddr="[172.16.0.13 ]" subsys=endpoint
However, after the cilium-agent container is rebuilt, a large number of the following errors appear in the logs:
When I check the endpoint in the /var/run/cilium/state
directory, I find that the directory situation is as follows, and it cannot be recovered
How can we reproduce the issue?
It is not clear how this problem is triggered, but perhaps the cilium-agent container restart is an important condition
Cilium Version
v1.12.5
Kernel Version
4.14.105-19-0019
Kubernetes Version
v1.22.5
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct