Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' #52858

kovaxur · 2024-08-26T17:19:46Z

Is this the right place to submit this?

This is not a security vulnerability or a crashing bug
This is not a question about how to use Istio

Bug Description

Hi,
We had a strange issue, where the Istio gateway reported upstream connect error or disconnect/reset before headers. reset reason: connection termination randomly sometimes for certain endpoints. I went into the http1.1/2/timeout rabbit hole but then I realized, that the pods, where we get this error are not reachable from the gateway at all, when I try to manually curl to the pod, I just get a connection refused error for both 15008 and 8080 (app port).

Then I realized, that the ztunnel pod on a node of the cluster is in "not ready" state and logs the following error:

2024-08-26T16:58:16.711297Z    info    dns::server    starting local DNS server    address=localhost:15053 component="dns"                                                                
2024-08-26T16:58:16.711448Z    info    inpod::statemanager    retrying workload failed: failed to bind to address [::1]:15053: Cannot assign requested address (os error 99)    uid="c9df3bc2-8943-4f2d-8441-9beb07aa0a04"

container name: istio-proxy
container image: gcr.io/istio-testing/ztunnel:1.24-alpha.d334295f1866d584af78164ad99e86bedd44a6ac-distroless

The container is stuck in ready=false state, has 0 restarts.

ztunnel-x4kkj:/root$ netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 ztunnel-x4kkj:41944     istiod.istio-system.svc.cluster.local:15012 ESTABLISHED 
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:41562 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15020     grafana-agent-metrics-0.grafana-XXXXX.svc.cluster.local:52036 ESTABLISHED 
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:47706 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:34532 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:43054 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:44438 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:40054 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:49342 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:34520 TIME_WAIT   
tcp        0      0 ztunnel-x4kkj:15021     10-201-0-26.kubelet.default.svc.cluster.local:43056 TIME_WAIT   
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path
unix  3      [ ]         SEQPACKET  CONNECTED      32937 /var/run/ztunnel/ztunnel.sock
unix  3      [ ]         SEQPACKET  CONNECTED      32936 
unix  3      [ ]         STREAM     CONNECTED      32909 
unix  3      [ ]         STREAM     CONNECTED      32908

ztunnel-x4kkj:/root$ netstat -tnlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:15000         0.0.0.0:*               LISTEN      -
tcp        0      0 ::1:15000               :::*                    LISTEN      -
tcp        0      0 :::15020                :::*                    LISTEN      -
tcp        0      0 :::15021                :::*                    LISTEN      -

ztunnel-x4kkj:/root$ netstat -unlp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name

ztunnel-x4kkj:/root$ lsof -i :15053
1	/bin/bash	0	/dev/pts/0
1	/bin/bash	1	/dev/pts/0
1	/bin/bash	2	/dev/pts/0
1	/bin/bash	255	/dev/pts/0

Trying to bind to port 15020 -> fails but it's ok, trying to bind to 15053 works without issue

ztunnel-x4kkj:/root$ nc -l 15020
nc: Address in use
ztunnel-x4kkj:/root$ nc -u -l 15053



^C

I'm using the alpha version due to #52260. Can this be related?

Version

Istio:
client version: 1.22.1
control plane version: 1.24-alpha.d334295f1866d584af78164ad99e86bedd44a6ac
data plane version: 1.23.0 (3 proxies), 1.24-alpha.d334295f1866d584af78164ad99e86bedd44a6ac (52 proxies)

Kubernetes:
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6-eks-db838b0

Additional Information

No response

The text was updated successfully, but these errors were encountered:

howardjohn · 2024-08-26T18:13:07Z

Is there any chance you can run those same commands in the pod network namespace? and additional ip addr?

In the meantime, defaults.meshConfig.defaultConfig.proxyMetadata.ISTIO_META_DNS_CAPTURE is a known workaround to this problem. We are investigating in istio/ztunnel#1272. Right now, we have many users report it, but have been unable to get our hands on the debugging info we need to solve it (or a reproduction)

joke · 2024-08-27T14:13:32Z

I'm facing the same problem.

Unfortunately the workaround (ISTIO_META_DNS_CAPTURE=false) causes DNS resolution failures:

sts.eu-central-1.amazonaws.com on 172.20.0.10:53: read udp 100.64.109.34:57420->172.20.0.10:53: read: connection refused

howardjohn · 2024-08-27T15:36:01Z

We really need the information (netstat -ntlp; netstat -nlup; ip addr; ip link from inside the failing pod network namespace (not ztunnel) to make progress here. We have had many users report this but no one provide this information, so we cannot do anything to solve it really.

I would be happy to even jump on a video call to walk someone through it. Please feel free to ping me on slack if you have this issue and are willing to troubleshoot.

Unfortunately the workaround (ISTIO_META_DNS_CAPTURE=false) causes DNS resolution failures:

Make sure you don't have values.cni.ambient.dnsCapture=true set (which will cause the redirection to the dns server you just disabled).

joke · 2024-08-28T15:36:24Z

@howardjohn contacted you via slack

Fixes istio/istio#52858

howardjohn · 2024-08-28T17:59:56Z

Thank you @joke and @bleggett for your help on slack, we have a fix ready in istio/ztunnel#1284. I was able to reproduce the issue both in a unit test and a live cluster, and the fix resolves the issue in both of these.

One word of warning is the fix is to make it so the retry of the failure succeeds. You may still see the error message, but it should resolve itself (the bug was that it never resolves).

I've slotted the fix to be cherrypicked to 1.23, so we should get this in for the upcoming 1.23.1 release.

As others have noted, one workaround for this problem in the meantime is to restart ztunnel or the istio-cni pod.

* zds: fix retrying a bad netns Fixes istio/istio#52858 * Fix 1.23 changes --------- Co-authored-by: John Howard <john.howard@solo.io>

Fixes istio/istio#52858

istio-policy-bot added area/ambient Issues related to ambient mesh area/networking labels Aug 26, 2024

istio deleted a comment from MinecraftEarthVillage Aug 26, 2024

howardjohn added a commit to howardjohn/ztunnel that referenced this issue Aug 28, 2024

zds: fix retrying a bad netns

2f4c915

Fixes istio/istio#52858

howardjohn mentioned this issue Aug 28, 2024

zds: fix retrying a bad netns istio/ztunnel#1284

Merged

howardjohn added a commit to howardjohn/ztunnel that referenced this issue Aug 28, 2024

zds: fix retrying a bad netns

5e90298

Fixes istio/istio#52858

istio-testing closed this as completed in istio/ztunnel#1284 Aug 28, 2024

istio-testing closed this as completed in istio/ztunnel@0cb7516 Aug 28, 2024

istio-testing pushed a commit to istio-testing/ztunnel that referenced this issue Aug 28, 2024

zds: fix retrying a bad netns

3be4704

Fixes istio/istio#52858

howardjohn changed the title ~~Ztunnel pod: failed to bind to address 15053~~ Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' Aug 28, 2024

istio-testing added a commit to istio/ztunnel that referenced this issue Aug 28, 2024

[release-1.23] zds: fix retrying a bad netns (#1285)

0382dc6

* zds: fix retrying a bad netns Fixes istio/istio#52858 * Fix 1.23 changes --------- Co-authored-by: John Howard <john.howard@solo.io>

r0bj mentioned this issue Sep 11, 2024

ztunnel fails to start with: "failed to bind to address [::1]:15053: Cannot assign requested address (os error 99)" #53087

Closed

2 tasks

antonioberben pushed a commit to antonioberben/ztunnel that referenced this issue Oct 1, 2024

zds: fix retrying a bad netns (istio#1284)

b5c6057

Fixes istio/istio#52858

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' #52858

Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' #52858

kovaxur commented Aug 26, 2024 •

edited by istio-policy-bot

Loading

howardjohn commented Aug 26, 2024

joke commented Aug 27, 2024 •

edited

Loading

howardjohn commented Aug 27, 2024

joke commented Aug 28, 2024

howardjohn commented Aug 28, 2024

Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' #52858

Ztunnel fails with 'failed to bind to address [::1]:15053: Cannot assign requested address' #52858

Comments

kovaxur commented Aug 26, 2024 • edited by istio-policy-bot Loading

Is this the right place to submit this?

Bug Description

Version

Additional Information

howardjohn commented Aug 26, 2024

joke commented Aug 27, 2024 • edited Loading

howardjohn commented Aug 27, 2024

joke commented Aug 28, 2024

howardjohn commented Aug 28, 2024

kovaxur commented Aug 26, 2024 •

edited by istio-policy-bot

Loading

joke commented Aug 27, 2024 •

edited

Loading