After upgrade Cilium to 1.16.6 or 1.17.0rc0, some pods lost network connectivity #37271
Open
Description
I am not sure my problem is related/similar to #36432
I am using version 1.16.2 and is working fine.
Edit 1: I tested with 1.16.4, it is ok.
I tried upgrade to 1.17.0rc1 and 1.16.6. One of the node always got problem with some pods.
After deleted those pods, they restart with a new pod. But the pod refuse to pass the health check.
On the host, it cannot reached the service port of the pod.
When check with the logs I got below:
Jan 25 05:13:48 server1 k3s[36942]: #011rpc error: code = Unknown desc = failed to setup network for sandbox "a2d3d57563530099f9ff0495817984bdf6aa8d5041c7f24dfbfdc3b1aecbe313": plugin type="cilium-cni" failed (add): unable to connect to Cilium agent: failed to create cilium agent client after 30.000000 seconds timeout: Get "http://localhost/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Jan 25 05:13:48 server1 k3s[36942]: #011rpc error: code = Unknown desc = failed to setup network for sandbox "a2d3d57563530099f9ff0495817984bdf6aa8d5041c7f24dfbfdc3b1aecbe313": plugin type="cilium-cni" failed (add): unable to connect to Cilium agent: failed to create cilium agent client after 30.000000 seconds timeout: Get "http://localhost/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory
Jan 25 05:13:48 server1 k3s[36942]: E0125 05:13:48.926700 36942 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"opentelemetry-operator-646c469bf7-28frj_otel(7c370fef-91f6-43bf-a7a4-3638b45fc073)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"opentelemetry-operator-646c469bf7-28frj_otel(7c370fef-91f6-43bf-a7a4-3638b45fc073)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"a2d3d57563530099f9ff0495817984bdf6aa8d5041c7f24dfbfdc3b1aecbe313\\\": plugin type=\\\"cilium-cni\\\" failed (add): unable to connect to Cilium agent: failed to create cilium agent client after 30.000000 seconds timeout: Get \\\"http://localhost/v1/config\\\": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory\\nIs the agent running?\"" pod="otel/opentelemetry-operator-646c469bf7-28frj" podUID="7c370fef-91f6-43bf-a7a4-3638b45fc073"
When I fall back to version 1.16.2 or 1.16.4, everything worked fine.
I am using Fedora 41 with kernel 6.12.7