-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Troubleshooting Istio Ambient
-
Install and setup issues
- Scenario: Ztunnel is not capturing my traffic
- Scenario: pod fails to run with
Failed to create pod sandbox
- Scenario: Ztunnel fails with
failed to bind to address [::1]:15053: Cannot assign requested address
- Scenario: Ztunnel fails with
failed to bind to address [::1]:15053: Address family not supported
-
Ztunnel Traffic Issues
- Understanding logs
- Scenario: Traffic timeout with Ztunnel
- Scenario: Readiness probes fail with Ztunnel
- Scenario: traffic fails with
timed out waiting for workload from xds
- Scenario: traffic fails with
unknown source
- Scenario: traffic fails with
no healthy upstream
- Scenario: traffic fails with
http status: ...
- Scenario: traffic fails with
connection closed due to connection drain
- Scenario: ztunnel logs
HBONE ping timeout/error
andping timeout
- Scenario: ztunnel is not sending egress traffic to waypoints
- Waypoint issues
- Common information
Before doing anything else please make sure you read and follow
- the latest Platform Requirements
- the latest Platform-Specific Prerequisites guide for your provider and CNI.
Failure to follow these guidelines will result in issues.
Follow these steps to troubleshoot Ztunnel not capturing traffic.
First, check the pod for the ambient.istio.io/redirection
. This indicates istio-cni
enabled redirection.
$ kubectl get pods shell-5b7cf9f6c4-npqgz -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
ambient.istio.io/redirection: enabled
If the annotation is missing: the pod was not enrolled in the mesh.
- Check the logs of the
istio-cni-node
pod on the same node as the pod for errors. Errors during enablement may be blocking the pod from getting traffic from Ztunnel. - Check the logs of the
istio-cni-pod
on the same node to verify it has ambient enabled. The pod should logAmbientEnabled: true
during startup. If this isfalse
, ensure you properly installed Istio with--set profile=ambient
. - Check the pod is actually configured to have ambient enabled. The criteria is as follows:
- The pod OR namespace must have the
istio.io/dataplane-mode=ambient
label set - The pod must not have the
sidecar.istio.io/status
annotation set (which is added automatically when a sidecar is injected) - The pod must not have
istio.io/dataplane-mode=none
set. - The pod must not have
spec.hostNetwork=true
If the annotation is present: this means Istio claims it enabled redirection for the pod, but apparently it isn't working.
- Check the iptables rules in the pod. Run a debug shell and run
iptables-save
. You should see something like below:
# iptables-save
# Generated by iptables-save v1.8.10 on Wed Sep 25 22:06:16 2024
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_PRERT - [0:0]
-A PREROUTING -j ISTIO_PRERT
-A OUTPUT -j ISTIO_OUTPUT
-A ISTIO_OUTPUT -d 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_OUTPUT -p tcp -m mark --mark 0x111/0xfff -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -p tcp -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15001
-A ISTIO_PRERT -s 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_PRERT ! -d 127.0.0.1/32 -p tcp -m tcp ! --dport 15008 -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15006
The exact contents may vary, but if there is anything relating to Istio here, it means iptables rules are installed.
2. Check if ztunnel
is running within the pod network. This can be done with netstat -ntl
. You should see listeners on a few Istio ports (15001, 15006, etc):
# netstat -ntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 127.0.0.1:15053 0.0.0.0:* LISTEN
tcp6 0 0 ::1:15053 :::* LISTEN
tcp6 0 0 :::15001 :::* LISTEN
tcp6 0 0 :::15006 :::* LISTEN
tcp6 0 0 :::15008 :::* LISTEN
- Check the logs of Ztunnel. When sending traffic, you should see logs like
info access connection complete ...
. Note that these are logged when connections are closed, not when they are opened, so you may not see logs for your application if they use long-lived connections.
For pods in the mesh, Istio will run a CNI plugin during the pod 'sandbox' creation. This configures the networking rules. This may intermittently fail, in which case Kubernetes will automatically retry.
This can fail for a few reasons:
-
no ztunnel connection
: this indicates that the CNI plugin is not connected to Ztunnel. Ensure Ztunnel is running on the same node and is healthy. -
failed to add IP ... to ipset istio-inpod-probes: exist
: this indicates Istio attempted to add the workload. This can be caused by a race condition in the Kubernetes IP allocation, in which a retry can resolve the issue. On Istio 1.22.3 and older, there was a bug causing this to not recover; please upgrade if so. Other occurrences of this may be a bug.
This is fixed in Istio 1.23.1+, please upgrade. See issue.
This indicates your kernel does not support IPv6.
IPv6 support can be turned off by setting IPV6_ENABLED=false
on Ztunnel.
When troubleshooting traffic issues, the first step should always be to analyze the access logs in Ztunnel. Note that there may be two Ztunnel pods involved in a request (the source and destination), so its useful to look at both sides.
Access logs by default log on each connection completion. Connection opening logs are available at debug
level (see how to set log level).
An example log looks like:
2024-09-25T22:08:30.213996Z info access connection complete src.addr=10.244.0.33:50676 src.workload="shell-5b7cf9f6c4-7hfkc" src.namespace="default" src.identity="spiffe://cluster.local/ns/default/sa/default" dst.addr=10.244.0.29:15008 dst.hbone_addr=10.96.99.218:80 dst.service="echo.default.svc.cluster.local" dst.workload="waypoint-66f44865c4-l7btm" dst.namespace="default" dst.identity="spiffe://cluster.local/ns/default/sa/waypoint" direction="outbound" bytes_sent=67 bytes_recv=518 duration="2ms"
- The
src
/dst
addr
,workload
,namespace
, andidentity
represent the information about the source and destination of the traffic. Not all information will be available for all traffic:-
identity
will only be set when mTLS is used. -
dst.namespace
anddst.workload
will not be present when traffic is sent to an unknown destination (passthrough traffic)
-
-
dst.service
represents the destination service, if the call was to a service. This is not always the case, as an application can reach aPod
directly. -
dst.hbone_addr
is set when using mTLS. In this case,hbone_addr
represents the target of the traffic, whiledst.addr
represents the actual address we connected to (for the tunnel). -
bytes_sent
andbytes_recv
indicate how many bytes were transferred during the connection. -
duration
indicates how long the connection was open -
error
, if present, indicates the connection had an error, and why.
In the above log, you can see that while the dst.service
is echo
, the dst.workload
(and dst.addr
) are for waypoint-...
.
This implies the traffic was sent to a waypoint proxy.
Traffic is blocked, showing a log, with errors like below:
error access connection complete direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="io error: deadline has elapsed"
error access connection complete direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008: deadline has elapsed"
- For the
connection timed out
error, this means the connection could not be established. This may be due to networking issues reaching the destination. A very common cause (hence the log) is to have a NetworkPolicy or other firewall rule blocking port15008
. Istio mTLS traffic is tunneled over port15008
, so this must be enabled (both on ingress and egress). - For the more generic errors like
io error: deadline has elapsed
, this generally is the same root causes as above. However, if traffic works without ambient, it is unlikely to be a typical firewall rule, as the traffic should be sent identically as without ambient enabled. This likely indicates an incompatibility with your Kubernetes setup.
After enabling ambient mode, pod readiness probe fails. For example, you may see something like below:
Warning Unhealthy 92s (x6 over 4m2s) kubelet Readiness probe failed: Get "http://1.1.1.1:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Ambient mode intends to not capture or impact any readiness probe traffic.
It does this by applying a SNAT
rule in the host, to rewrite any traffic from kubelet
as coming from 169.254.7.127
, then skipping redirection for any traffic matching this pattern.
Readiness probe failures that start when enabling ambient typically indicate an environmental issue with this traffic rewrite.
For instance:
- Cilium with
bpf.masquerade=true
breaks this (platform prequisites guide, issue) - Calico, before 3.29, with
bpfEnabled
set, breaks this (issue) - AWS Security Groups may block this traffic (issue)
When traffic is sent from a pod, Ztunnel must first get information about the pod from Istiod (over the XDS protocol). If it fails to do so after 5s, it will reject the connection with this error.
Istiod is generally expected to return information substantially sooner than with 5s.
If this error happens intermittently, however, it may indicate this is not happening.
This could be caused by istiod
being overloaded, or possible modifications that increase PILOT_DEBOUNCE_AFTER
(which can slow down updates).
If the issue persistently happens, it is likely a bug; please file an issue.
This indicates Ztunnel was unable to identify the source of traffic. In Istio 1.23, Ztunnel would attempt to map the source IP of traffic to a known workload. If the workload has multiple network interfaces, this may prevent Ztunnel from making this association.
Istio 1.24+ does not require this mapping.
This indicates traffic to a Service had no applicable backends.
We can see how Ztunnel views the Service's health:
$ istioctl zc services
NAMESPACE SERVICE NAME SERVICE VIP WAYPOINT ENDPOINTS
default echo 10.96.99.1 None 3/4
This indicates there are 4 endpoints for the service, but 1 was unhealthy.
Next we can look at how Kubernetes views the service:
$ kubectl get endpointslices
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
echo-v76p9 IPv4 8080 10.244.0.20,10.244.0.36 + 1 more... 7h50m
Here we also see 3 endpoints.
If Kubernetes shows zero healthy endpoints, it indicates there is not an issue in the Istio setup, but rather the service is actually unhealthy. Check to ensure its' labels select the expected workloads, and that those pods are marked as "ready".
If this is seen for the kubernetes
service, this may be fixed in Istio 1.23+ and Istio 1.22.3+.
If this is seen for hostNetwork
pods, or other scenarios where multiple workloads have the same IP address, this may be fixed in Istio 1.24+.
Ztunnel acts as a TCP proxy, and does not parse users HTTP traffic at all. So it may be confusing that Ztunnel reports an HTTP error.
This is the result of the tunneling protocol ("HBONE") ztunnel uses, which is over HTTP CONNECT. An error like this indicates ztunnel was able to establish an HBONE connection, but the stream was rejected.
When communicating to another Ztunnel, this may be caused by various issues:
-
400 Bad Request
: the request was entirely invalid; this may indicate a bug -
401 Unauthorized
: request was rejected by AuthorizationPolicy rules -
503 Service Unavailable
: the destination is not available
When communicating with a waypoint proxy (Envoy), there is a wider range of response codes possible. 401
for AuthorizationPolicy rejection and 503
as a general catch-all are common.
When Ztunnel shuts down an instance of a proxy, it will close any outstanding connections.
This will be preceded with a log like inpod::statemanager pod delete request, shutting down proxy
for the pod.
This can happen:
- If the Pod is actually deleted. In this case, the connections are generally already closed, though.
- If Ztunnel itself is shutting down.
- If the pod was un-enrolled from ambient mode.
These logs can be ignored. They are removed in Istio 1.23.1+. See issue for details.
Consider a ServiceEntry
like:
apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
name: example.com
labels:
istio.io/use-waypoint: my-waypoint
spec:
hosts:
- example.com
ports:
- number: 80
name: http
protocol: HTTP
resolution: DNS
Unlike a typical Service
, this will not necesarily have two components needed for traffic capture to work:
- It will not have a stable Service IP address known to Istio (
example.com
may have many, changing, IPs). - We do not have DNS setup to return such a stable IP address, if one did exist.
Istio handles has two features to resolve these:
-
values.pilot.env.PILOT_ENABLE_IP_AUTOALLOCATE=true
enables a controller that will allocate an IP address for theServiceEntry
and write it into the object. You can view it in the ServiceEntry itself:status: addresses: - host: example.com value: 240.240.0.3 - host: example.com value: 2001:2::3
-
values.cni.ambient.dnsCapture=true
will enable the Ztunnel to handle DNS, which allows it to respond with the above IP addresses in response to a query toexample.com
. Note you will need to restart workloads after changing this setting.
Together, this enables egress traffic to traverse a waypoint. To troubleshoot this:
- Ensure the
ServiceEntry
has an IP address in the status. - Check the pod is getting this IP address in DNS lookups.
- Check whether this IP shows up as the destination IP address in Ztunnel.
First, we will want to see some signs that indicate traffic is traversing a waypoint:
- Requests sent to the waypoint will generally go through Envoy's HTTP processing, which will mutate the request. For example, by default headers will be translated to lowercase and a few Envoy headers are injected:
x-envoy-upstream-service-time: 2
server: istio-envoy
x-envoy-decorator-operation: echo.default.svc.cluster.local:80/*
Note this is not always the case, as traffic may be set as TCP.
-
Waypoint access logs, if enabled, will log each request. See here to enable access logs.
-
Ztunnel access logs, if enabled, will log each request. See here for an example log to a waypoint.
Traffic can be sent to a service or directly to a workload. While sending to a service is typical, see the ztunnel access logs to identify the type of traffic. Similarly, a waypoint can be associated with a service, a workload, or both. Mismatches between these can cause the waypoint to not be utilized.
Tip
Cilium with bpf-lb-sock
requires bpf-lb-sock-hostns-only
to be set, or all traffic will be incorrectly treated as direct-to-workload traffic. (issue).
Next, we can check if Ztunnel is configured to send to a waypoint:
$ istioctl zc services
NAMESPACE SERVICE NAME SERVICE VIP WAYPOINT ENDPOINTS
default echo 10.96.0.1 waypoint 1/1
default no-waypoint 10.96.0.2 None 1/1
$ istioctl zc workloads
NAMESPACE POD NAME ADDRESS NODE WAYPOINT PROTOCOL
default echo-79dcbf57cc-l2cdp 10.244.0.1 node None HBONE
default product-59896bc9f7-kp4lb 10.244.0.2 node waypoint HBONE
This indicates the echo
Service, and the product-59896bc9f7-kp4lb
Pod are bound to the waypoint
.
If Ztunnel is configured to use the waypoint for the destination but traffic isn't going to the waypoint, it is likely traffic is actually going to the wrong destination.
Check the ztunnel access logs to verify the destination service/workload and ensure it matches.
If there None
is found, Ztunnel isn't programmed to use the waypoint.
- Check the status on the object. This should give an indication whether it was attached to the waypoint or not. (Note: this is available in 1.24+, and currently only on
Service
andServiceEntry
)
$ kubectl get svc echo -oyaml
status:
conditions:
- lastTransitionTime: "2024-09-25T19:28:16Z"
message: Successfully attached to waypoint default/waypoint
reason: WaypointAccepted
status: "True"
type: istio.io/WaypointBound
- Check what resources have been configure to use a waypoint:
$ kubectl get namespaces -L istio.io/use-waypoint
NAME STATUS AGE USE-WAYPOINT
namespace/default Active 1h waypoint
namespace/istio-system Active 1h
You will want to look at namespaces
in all cases, services
and serviceentries
for service cases, and pods
and workloadentries
for workload cases.
This label must be set to associate a resource with a waypoint
3. If the label is present, this may be cause by the waypoint being missing or unhealthy. Check the Gateway
objects and ensure the waypoint is deployed.
$ kubectl get gateways.gateway.networking.k8s.io
NAME CLASS ADDRESS PROGRAMMED AGE
waypoint istio-waypoint False 17s
Above shows an example of a waypoint that is deployed, but is not healthy. A waypoint will not be enabled until it becomes healthy at least once. If it is not healthy, check the status for more information.
If the Gateway
isn't present at all, deploy one!
Most pods have low privileges and few debug tools available.
For some diagnostics its helpful to run an ephemeral container with elevated privileges and utilities.
The istio/base
image can be used for this, along with kubectl debug --profile sysadmin
.
For example:
$ kubectl debug --image istio/base --profile sysadmin --attach -t -i shell-5b7cf9f6c4-npqgz
To view the current log level, run:
$ istioctl zc log ztunnel-cqg6c
ztunnel-cqg6c.istio-system:
current log level is info
To set the log level:
$ istioctl zc log ztunnel-cqg6c --level=info,access=debug
ztunnel-cqg6c.istio-system:
current log level is hickory_server::server::server_future=off,access=debug,info
To set at Ztunnel pod startup, configure the environment variable:
$ kubectl -n istio-system set env ds/ztunnel RUST_LOG=info
Visit istio.io to learn how to use Istio.
- Preparing for Development Mac
- Preparing for Development Linux
- Troubleshooting Development Environment
- Repository Map
- GitHub Workflow
- Github Gmail Filters
- Using the Code Base
- Developing with Minikube
- Remote Debugging
- Verify your Docker Environment
- Istio Test Framework
- Working with Prow
- Test Grid
- Code Coverage FAQ
- Writing Good Integration Tests
- Test Flakes
- Release Manager Expectations
- Preparing Istio Releases
- 1.5 Release Information
- 1.6 Release Information
- 1.7 Release Information
- 1.8 Release Information
- 1.9 Release Information
- 1.10 Release Information
- 1.11 Release Information
- 1.12 Release Information
- 1.13 Release Information
- 1.14 Release Information
- 1.15 Release Information
- 1.16 Release Information
- 1.17 Release Information
- 1.18 Release Information
- 1.19 Release Information
- 1.20 Release Information
- 1.21 Release Information
- 1.22 Release Information
- 1.23 Release Information
- 1.24 Release Information