Troubleshooting Istio Ambient

Troubleshooting Ambient

This document provides troubleshooting steps for common problems users have encountered when using ambient mode. Many of these failure scenarios are former bugs that have been fixed; where applicable, the section will indicate which version the issue is fixed in.

At the time of writing, this document covers Istio 1.23+. In the future, items impacting only older versions will be removed. It is strongly encourage to run on at least Istio 1.24, which is when ambient mesh went "GA". Despite this, reproducing the issue on the latest version is always a good first diagnostic step.

Install and setup issues

Before doing anything else please make sure you read and follow

Failure to follow these guidelines will result in issues.

Scenario: Ztunnel is not capturing my traffic

Follow these steps to troubleshoot Ztunnel not capturing traffic.

First, check the pod for the ambient.istio.io/redirection. This indicates istio-cni enabled redirection.

$ kubectl get pods shell-5b7cf9f6c4-npqgz -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    ambient.istio.io/redirection: enabled

If the annotation is missing: the pod was not enrolled in the mesh.

Check the logs of the istio-cni-node pod on the same node as the pod for errors. Errors during enablement may be blocking the pod from getting traffic from Ztunnel.
Check the logs of the istio-cni-pod on the same node to verify it has ambient enabled. The pod should log AmbientEnabled: true during startup. If this is false, ensure you properly installed Istio with --set profile=ambient.
Check the pod is actually configured to have ambient enabled. The criteria is as follows:

The pod OR namespace must have the istio.io/dataplane-mode=ambient label set
The pod must not have the sidecar.istio.io/status annotation set (which is added automatically when a sidecar is injected)
The pod must not have istio.io/dataplane-mode=none set.
The pod must not have spec.hostNetwork=true

If the annotation is present: this means Istio claims it enabled redirection for the pod, but apparently it isn't working.

Check the iptables rules in the pod. Run a debug shell and run iptables-save. You should see something like below:

# iptables-save
# Generated by iptables-save v1.8.10 on Wed Sep 25 22:06:16 2024
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_PRERT - [0:0]
-A PREROUTING -j ISTIO_PRERT
-A OUTPUT -j ISTIO_OUTPUT
-A ISTIO_OUTPUT -d 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_OUTPUT -p tcp -m mark --mark 0x111/0xfff -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -p tcp -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15001
-A ISTIO_PRERT -s 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_PRERT ! -d 127.0.0.1/32 -p tcp -m tcp ! --dport 15008 -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15006

The exact contents may vary, but if there is anything relating to Istio here, it means iptables rules are installed. 2. Check if ztunnel is running within the pod network. This can be done with netstat -ntl. You should see listeners on a few Istio ports (15001, 15006, etc):

# netstat -ntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:15053         0.0.0.0:*               LISTEN
tcp6       0      0 ::1:15053               :::*                    LISTEN
tcp6       0      0 :::15001                :::*                    LISTEN
tcp6       0      0 :::15006                :::*                    LISTEN
tcp6       0      0 :::15008                :::*                    LISTEN

Check the logs of Ztunnel. When sending traffic, you should see logs like info access connection complete .... Note that these are logged when connections are closed, not when they are opened, so you may not see logs for your application if they use long-lived connections.

Scenario: pod fails to run with `Failed to create pod sandbox`

For pods in the mesh, Istio will run a CNI plugin during the pod 'sandbox' creation. This configures the networking rules. This may intermittently fail, in which case Kubernetes will automatically retry.

This can fail for a few reasons:

no ztunnel connection: this indicates that the CNI plugin is not connected to Ztunnel. Ensure Ztunnel is running on the same node and is healthy.
failed to add IP ... to ipset istio-inpod-probes: exist: this indicates Istio attempted to add the workload. This can be caused by a race condition in the Kubernetes IP allocation, in which a retry can resolve the issue. On Istio 1.22.3 and older, there was a bug causing this to not recover; please upgrade if so. Other occurrences of this may be a bug.

Scenario: Ztunnel fails with `failed to bind to address [::1]:15053: Cannot assign requested address`

This is fixed in Istio 1.23.1+, please upgrade. See issue.

Scenario: Ztunnel fails with `failed to bind to address [::1]:15053: Address family not supported`

This indicates your kernel does not support IPv6. IPv6 support can be turned off by setting IPV6_ENABLED=false on Ztunnel.

Ztunnel Traffic Issues

Understanding logs

When troubleshooting traffic issues, the first step should always be to analyze the access logs in Ztunnel. Note that there may be two Ztunnel pods involved in a request (the source and destination), so its useful to look at both sides.

Access logs by default log on each connection completion. Connection opening logs are available at debug level (see how to set log level).

An example log looks like:

2024-09-25T22:08:30.213996Z     info    access  connection complete     src.addr=10.244.0.33:50676 src.workload="shell-5b7cf9f6c4-7hfkc" src.namespace="default" src.identity="spiffe://cluster.local/ns/default/sa/default" dst.addr=10.244.0.29:15008 dst.hbone_addr=10.96.99.218:80 dst.service="echo.default.svc.cluster.local" dst.workload="waypoint-66f44865c4-l7btm" dst.namespace="default" dst.identity="spiffe://cluster.local/ns/default/sa/waypoint" direction="outbound" bytes_sent=67 bytes_recv=518 duration="2ms"

The src/dst addr, workload, namespace, and identity represent the information about the source and destination of the traffic. Not all information will be available for all traffic:
- identity will only be set when mTLS is used.
- dst.namespace and dst.workload will not be present when traffic is sent to an unknown destination (passthrough traffic)
dst.service represents the destination service, if the call was to a service. This is not always the case, as an application can reach a Pod directly.
dst.hbone_addr is set when using mTLS. In this case, hbone_addr represents the target of the traffic, while dst.addr represents the actual address we connected to (for the tunnel).
bytes_sent and bytes_recv indicate how many bytes were transferred during the connection.
duration indicates how long the connection was open
error, if present, indicates the connection had an error, and why.

In the above log, you can see that while the dst.service is echo, the dst.workload (and dst.addr) are for waypoint-.... This implies the traffic was sent to a waypoint proxy.

Scenario: Traffic timeout with Ztunnel

Traffic is blocked, showing a log, with errors like below:

error   access  connection complete      direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="io error: deadline has elapsed"
error   access  connection complete      direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008: deadline has elapsed"

For the connection timed out error, this means the connection could not be established. This may be due to networking issues reaching the destination. A very common cause (hence the log) is to have a NetworkPolicy or other firewall rule blocking port 15008. Istio mTLS traffic is tunneled over port 15008, so this must be enabled (both on ingress and egress).
For the more generic errors like io error: deadline has elapsed, this generally is the same root causes as above. However, if traffic works without ambient, it is unlikely to be a typical firewall rule, as the traffic should be sent identically as without ambient enabled. This likely indicates an incompatibility with your Kubernetes setup.

Scenario: Readiness probes fail with Ztunnel

After enabling ambient mode, pod readiness probe fails. For example, you may see something like below:

  Warning  Unhealthy               92s (x6 over 4m2s)   kubelet                  Readiness probe failed: Get "http://1.1.1.1:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Ambient mode intends to not capture or impact any readiness probe traffic. It does this by applying a SNAT rule in the host, to rewrite any traffic from kubelet as coming from 169.254.7.127, then skipping redirection for any traffic matching this pattern.

Readiness probe failures that start when enabling ambient typically indicate an environmental issue with this traffic rewrite.

For instance:

Cilium with bpf.masquerade=true breaks this (platform prequisites guide, issue)
Calico, before 3.29, with bpfEnabled set, breaks this (issue)
AWS Security Groups may block this traffic (issue)

Scenario: traffic fails with `timed out waiting for workload from xds`

When traffic is sent from a pod, Ztunnel must first get information about the pod from Istiod (over the XDS protocol). If it fails to do so after 5s, it will reject the connection with this error.

Istiod is generally expected to return information substantially sooner than with 5s. If this error happens intermittently, however, it may indicate this is not happening. This could be caused by istiod being overloaded, or possible modifications that increase PILOT_DEBOUNCE_AFTER (which can slow down updates).

If the issue persistently happens, it is likely a bug; please file an issue.

Warning: prior to Istio 1.24, there were a few bugs that could trigger this unrelated to timing issues.

Scenario: traffic fails with `unknown source`

This indicates Ztunnel was unable to identify the source of traffic. In Istio 1.23, Ztunnel would attempt to map the source IP of traffic to a known workload. If the workload has multiple network interfaces, this may prevent Ztunnel from making this association.

Istio 1.24+ does not require this mapping.

Scenario: traffic fails with `no healthy upstream`

This indicates traffic to a Service had no applicable backends.

We can see how Ztunnel views the Service's health:

$ istioctl zc services
NAMESPACE    SERVICE NAME         SERVICE VIP    WAYPOINT ENDPOINTS
default      echo                 10.96.99.1     None     3/4

This indicates there are 4 endpoints for the service, but 1 was unhealthy.

Next we can look at how Kubernetes views the service:

$ kubectl get endpointslices
NAME          ADDRESSTYPE   PORTS    ENDPOINTS                           AGE
echo-v76p9    IPv4          8080     10.244.0.20,10.244.0.36 + 1 more... 7h50m

Here we also see 3 endpoints.

If Kubernetes shows zero healthy endpoints, it indicates there is not an issue in the Istio setup, but rather the service is actually unhealthy. Check to ensure its' labels select the expected workloads, and that those pods are marked as "ready".

If this is seen for the kubernetes service, this may be fixed in Istio 1.23+ and Istio 1.22.3+.

If this is seen for hostNetwork pods, or other scenarios where multiple workloads have the same IP address, this may be fixed in Istio 1.24+.

Scenario: traffic fails with `http status: ...`

Ztunnel acts as a TCP proxy, and does not parse users HTTP traffic at all. So it may be confusing that Ztunnel reports an HTTP error.

This is the result of the tunneling protocol ("HBONE") ztunnel uses, which is over HTTP CONNECT. An error like this indicates ztunnel was able to establish an HBONE connection, but the stream was rejected.

When communicating to another Ztunnel, this may be caused by various issues:

400 Bad Request: the request was entirely invalid; this may indicate a bug
401 Unauthorized: request was rejected by AuthorizationPolicy rules
503 Service Unavailable: the destination is not available

When communicating with a waypoint proxy (Envoy), there is a wider range of response codes possible. 401 for AuthorizationPolicy rejection and 503 as a general catch-all are common.

Scenario: traffic fails with `connection closed due to connection drain`

When Ztunnel shuts down an instance of a proxy, it will close any outstanding connections. This will be preceded with a log like inpod::statemanager pod delete request, shutting down proxy for the pod.

This can happen:

If the Pod is actually deleted. In this case, the connections are generally already closed, though.
If Ztunnel itself is shutting down.
If the pod was un-enrolled from ambient mode.

See this blog post for more information.

Scenario: ztunnel logs `HBONE ping timeout/error` and `ping timeout`

(Fixed in Istio 1.23.1+)

These logs can be ignored. See issue for details.

Scenario: ztunnel is not sending egress traffic to waypoints

Consider a ServiceEntry like:

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: example.com
  labels:
    istio.io/use-waypoint: my-waypoint
spec:
  hosts:
  - example.com
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: DNS

Unlike a typical Service, this will not necesarily have two components needed for traffic capture to work:

It will not have a stable Service IP address known to Istio (example.com may have many, changing, IPs).
We do not have DNS setup to return such a stable IP address, if one did exist.

Istio handles has two features to resolve these:

values.pilot.env.PILOT_ENABLE_IP_AUTOALLOCATE=true (default in Istio 1.25+) enables a controller that will allocate an IP address for the ServiceEntry and write it into the object. You can view it in the ServiceEntry itself:
```
status:
  addresses:
  - host: example.com
    value: 240.240.0.3
  - host: example.com
    value: 2001:2::3
```
values.cni.ambient.dnsCapture=true (default in Istio 1.25+) will enable the Ztunnel to handle DNS, which allows it to respond with the above IP addresses in response to a query to example.com. Note you will need to restart workloads after changing this setting.

Together, this enables egress traffic to traverse a waypoint. To troubleshoot this:

Ensure the ServiceEntry has an IP address in the status.
Check the pod is getting this IP address in DNS lookups.
Check whether this IP shows up as the destination IP address in Ztunnel.

Scenario: traffic from sidecars/gateways fails with `upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end`

(Fixed in Istio 1.25)

This is caused by having a DestinationRule with TLS mode ISTIO_MUTUAL configured. Prior to Istio 1.25, a bug prevented this configuration from working. If you see this error, please upgrade or remove the DestinationRule configuration. Note that Istio will automatically use mTLS when possible, even with any DestinationRule configuration.

Scenario: traffic fails with `Connection timed out`

Ztunnel traffic may show an error with the following symptoms: bytes_sent=5804 bytes_recv=14441 duration="5194336ms" error="io error: Connection timed out (os error 110)". Note the bytes_sent and bytes_recv are non-zero, and the connection is long-lived.

The error message here is slightly misleading: this means its not the connection handshake that has timed out, but some other aspect of the connection.

There are a few possible areas that this can occur on:

TCP Keepalive timeouts. As of Istio 1.24, Ztunnel will enable keepalives on connections by default. If these fail repeatedly, the connection will be terminated with this error.
TCP retransmission timeouts. The kernel will attempt to retransmit TCP messages, and eventually timeout and close the connection.

There are two useful tools in understanding these timeouts:

ss -nto, the -o flag enables a timers column. You may see an output like so:
```
State       Local Address:Port       Peer Address:Port
ESTAB       10.0.0.2:39130           10.0.0.3:8080      timer:(keepalive,15sec,0)
ESTAB       10.0.0.2:39140           10.0.0.3:8080
FIN-WAIT-1  10.0.0.2:52554           10.0.0.4:8080      timer:(on,1min38sec,12)
```
The keepalive timer is from TCP Keepalives added by Istio to the connection. 15sec means the next keepalive will be sent in 15 seconds, and the 0 indicates we are on attempt 0 (which means we got a response to our most recent keepalive probe). Notably, there are two connections: one from the application to ztunnel, and one from ztunnel to the destination. These both appear the same in the output. In the above example, the application does not utilize keepalives, so only 1 of 2 of the connections have a keepalive timer.

The on timer for the FIN-WAIT-1 is a TCP retransmission timer.
nstat gives detailed counters of low-level networking events from the kernel. By default, the command will emit stats between the current and previous call, but this can be configured with -a (show everything) and -r (don't reset for the next call).

This tool can be useful to understand kernel level TCP errors that are otherwise not particularly visible.

A common cause of the Connection timed out error is caused when the connection is not gracefully terminated properly, and attempts to do so are blocked by the server.

This is described in the Kernel Documentation. Reproducing this in an ambient mesh can produce the following counters:

$ nstat
#kernel
IpInReceives                    2                0.0
IpInDelivers                    2                0.0
IpOutRequests                   2                0.0
IpOutTransmits                  2                0.0
TcpInSegs                       2                0.0
TcpOutSegs                      2                0.0
TcpOutRsts                      1                 0.0
TcpExtTCPTimeouts               1                 0.0
TcpExtTCPAbortOnTimeout         1                 0.0
TcpExtTCPOrigDataSent           1                 0.0

Along with the ztunnel error bytes_sent=5804 bytes_recv=14441 duration="5194336ms" error="io error: Connection timed out (os error 110)". Note it is specifically the TcpExtTCPAbortOnTimeout that ultimately triggers a ETIMEDOUT to be returned on the connection.

Waypoint issues

Scenario: traffic is not going through the waypoint

First, we will want to see some signs that indicate traffic is traversing a waypoint:

Requests sent to the waypoint will generally go through Envoy's HTTP processing, which will mutate the request. For example, by default headers will be translated to lowercase and a few Envoy headers are injected:

x-envoy-upstream-service-time: 2
server: istio-envoy
x-envoy-decorator-operation: echo.default.svc.cluster.local:80/*

Note this is not always the case, as traffic may be set as TCP.

Waypoint access logs, if enabled, will log each request. See here to enable access logs.
Ztunnel access logs, if enabled, will log each request. See here for an example log to a waypoint.

Traffic can be sent to a service or directly to a workload. While sending to a service is typical, see the ztunnel access logs to identify the type of traffic. Similarly, a waypoint can be associated with a service, a workload, or both. Mismatches between these can cause the waypoint to not be utilized.

Tip

Cilium with bpf-lb-sock requires bpf-lb-sock-hostns-only to be set, or all traffic will be incorrectly treated as direct-to-workload traffic. (issue).

Next, we can check if Ztunnel is configured to send to a waypoint:

$ istioctl zc services
NAMESPACE    SERVICE NAME         SERVICE VIP  WAYPOINT ENDPOINTS
default      echo                 10.96.0.1    waypoint 1/1
default      no-waypoint          10.96.0.2    None     1/1
$ istioctl zc workloads
NAMESPACE  POD NAME                     ADDRESS     NODE  WAYPOINT     PROTOCOL
default    echo-79dcbf57cc-l2cdp        10.244.0.1  node  None         HBONE
default    product-59896bc9f7-kp4lb     10.244.0.2  node  waypoint     HBONE

This indicates the echo Service, and the product-59896bc9f7-kp4lb Pod are bound to the waypoint. If Ztunnel is configured to use the waypoint for the destination but traffic isn't going to the waypoint, it is likely traffic is actually going to the wrong destination. Check the ztunnel access logs to verify the destination service/workload and ensure it matches.

If there None is found, Ztunnel isn't programmed to use the waypoint.

Check the status on the object. This should give an indication whether it was attached to the waypoint or not. (Note: this is available in 1.24+, and currently only on Service and ServiceEntry)

$ kubectl get svc echo -oyaml
status:
  conditions:
  - lastTransitionTime: "2024-09-25T19:28:16Z"
    message: Successfully attached to waypoint default/waypoint
    reason: WaypointAccepted
    status: "True"
    type: istio.io/WaypointBound

Check what resources have been configure to use a waypoint:

$ kubectl get namespaces -L istio.io/use-waypoint
NAME                           STATUS   AGE    USE-WAYPOINT
namespace/default              Active   1h     waypoint
namespace/istio-system         Active   1h

You will want to look at namespaces in all cases, services and serviceentries for service cases, and pods and workloadentries for workload cases.

This label must be set to associate a resource with a waypoint 3. If the label is present, this may be cause by the waypoint being missing or unhealthy. Check the Gateway objects and ensure the waypoint is deployed.

$ kubectl get gateways.gateway.networking.k8s.io
NAME       CLASS            ADDRESS   PROGRAMMED   AGE
waypoint   istio-waypoint             False        17s

Above shows an example of a waypoint that is deployed, but is not healthy. A waypoint will not be enabled until it becomes healthy at least once. If it is not healthy, check the status for more information.

If the Gateway isn't present at all, deploy one!

Scenario: waypoint fails to startup `cannot bind '0.0.0.0:15008': Address already in use`

When deploying a waypoint, it is fully enabled for usage in the mesh without needing to enable ztunnel for it. This is done by setting the istio.io/dataplane-mode: none label on the pods, which is automatically handled for you.

However, if you explicitly override the waypoint to istio.io/dataplane-mode: ambient it will attempt to add ztunnel to the waypoint pod, meaning there will be two components trying to handle mesh communications, which will conflict with each other.

When deploying a waypoint, you should not set istio.io/dataplane-mode.

Common information

Running a debug shell

Most pods have low privileges and few debug tools available. For some diagnostics its helpful to run an ephemeral container with elevated privileges and utilities. The istio/base image can be used for this, along with kubectl debug --profile sysadmin.

For example:

$ kubectl debug --image istio/base --profile sysadmin --attach -t -i shell-5b7cf9f6c4-npqgz

Setting log level

To view the current log level, run:

$ istioctl zc log ztunnel-cqg6c
ztunnel-cqg6c.istio-system:
current log level is info

To set the log level:

$ istioctl zc log ztunnel-cqg6c --level=info,access=debug
ztunnel-cqg6c.istio-system:
current log level is hickory_server::server::server_future=off,access=debug,info

To set at Ztunnel pod startup, configure the environment variable:

$ kubectl  -n istio-system  set env ds/ztunnel RUST_LOG=info

Troubleshooting Istio Ambient

Troubleshooting Ambient

Install and setup issues

Scenario: Ztunnel is not capturing my traffic

Scenario: pod fails to run with Failed to create pod sandbox

Scenario: Ztunnel fails with failed to bind to address [::1]:15053: Cannot assign requested address

Scenario: Ztunnel fails with failed to bind to address [::1]:15053: Address family not supported

Ztunnel Traffic Issues

Understanding logs

Scenario: Traffic timeout with Ztunnel

Scenario: Readiness probes fail with Ztunnel

Scenario: traffic fails with timed out waiting for workload from xds

Scenario: traffic fails with unknown source

Scenario: traffic fails with no healthy upstream

Scenario: traffic fails with http status: ...

Scenario: traffic fails with connection closed due to connection drain

Scenario: ztunnel logs HBONE ping timeout/error and ping timeout

Scenario: ztunnel is not sending egress traffic to waypoints

Scenario: traffic from sidecars/gateways fails with upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:TLS_error_end

Scenario: traffic fails with Connection timed out

Waypoint issues

Scenario: traffic is not going through the waypoint

Scenario: waypoint fails to startup cannot bind '0.0.0.0:15008': Address already in use

Common information

Running a debug shell

Setting log level

Troubleshooting

Dev Environment

Writing Code

Pull Requests

Testing

Performance

Releases

Misc

Central Istiod

Pilot

Telemetry

Clone this wiki locally