Skip to content

cilium do not restore some endpoint and leave them stale #37077

Open
@fzu-huang

Description

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.14.18 and lower than v1.15.0

What happened?

In my environment, after a cilium pod reconstruction on a node, some pods on the node entered the crash state. It was found that some local Pods could not be pinged on the node, and the ping output was similar to the following output:

# ping 172.16.0.13
PING 172.16.0.13 (172.16.0.13) 56(84) bytes of data.
From 172.16.0.1 icmp_seq=1 Time to live exceeded
From 172.16.0.1 icmp_seq=2 Time to live exceeded

I searched cilium endpoint in the cilium-agent container on the node through commands such as cilium endpoint list | grep $podIP, and the result was that no corresponding endpoint could be found. But I can capture the log of creating cilium endpoint when the pod is created:

level=debug msg="Endpoint successfully created" containerID=a7fe558a0ace4be4e6bf06895272cf5eecdc9b1adb41b52b0a230a7b12943cf0 eventUUID=e7a3754e-2222-48ad-af43-733d0db20b1a subsys=cilium-cni

Also I found some restore log in cilium-agent:

2025-01-15T01:35:04.106977887+08:00 stdout F level=info msg="Restored endpoint" endpointID=3718 ipAddr="[172.16.0.13 ]" subsys=endpoint

However, after the cilium-agent container is rebuilt, a large number of the following errors appear in the logs:

Image

When I check the endpoint in the /var/run/cilium/state directory, I find that the directory situation is as follows, and it cannot be recovered

How can we reproduce the issue?

It is not clear how this problem is triggered, but perhaps the cilium-agent container restart is an important condition

Cilium Version

v1.12.5

Kernel Version

4.14.105-19-0019

Kubernetes Version

v1.22.5

Regression

No response

Sysdump

No response

Relevant log output

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Assignees

No one assigned

    Labels

    kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.sig/agentCilium agent related.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions