iptables kube-proxy could handle UDP backend changes better #19029

joshk0 · 2015-12-22T21:31:48Z

I encountered a strange issue after opting-in to kube-proxy iptables support.

Steps to repro (I think):

Create a Service for a UDP port (say 8125; in my case, this was statsd)
Create pods that bind to this Service
Start a client Pod that uses the Service, e.g. via environment variable which specifies the VIP for the Service. This client writes a UDP packet to the Service every 10 seconds using the same socket (e.g. DialUDP is only called once)
Log in to the node. Use ngrep -q '' udp port 8125 to view the outgoing UDP traffic from the box to the service port. Observe that rewriting has occurred, and the packet is destined to one of the concrete endpoints specified in the Kubernetes model.
Delete that endpoint by, for example, killing the corresponding Pod which the client is communicating with
Observe that the Pod's IP no longer appears in endpoints (e.g. using kubectl)
BUG: Observe that ngrep continues to report that the packets are being rewritten to the old endpoint.
BUG (corollary): Observe that conntrack -L -d SERVICE_VIP shows that same socket being routed to the old endpoint.
WORKAROUND: Restart the client Pod, or have the client Pod call DialUDP each time it needs to send data.

The only solution i can see is that kube-proxy, when it rewrites iptables rules based on endpoints changes, needs to reset connections between local sockets and destroyed endpoints.

This didn't happen with the userspace kube-proxy, because kube-proxy was accepting the packets locally regardless of the endpoints, and would always use the latest endpoints information to forward the packet on.

Sorry for the long bug report, but I think it should be pretty clear by now if you've made it here. :)

thockin · 2015-12-23T06:17:20Z

With a UDP "connection" you can be sending packets to the old IP to your
heart's content and they will go nowhere and that is just part of using
UDP.

Take the proxy out of the picture: If you net.DialUDP to a non-existent
port and send data, it will happily "succeed" but there's nobody
listening. If you have a UDP "connection" to an IP:port and are sending
data and the remote side dies, your app doesn't really know (unless the
protocol you define atop UDP detects it).

I guess we could use conntrack -D -p udp probably with -r (but I'd have
to try it) to kill NAT entries. Seems like a pretty easy project for
someone to tackle.

On Tue, Dec 22, 2015 at 1:32 PM, Joshua Kwan notifications@github.com
wrote:

I encountered a strange issue after opting-in to kube-proxy iptables
support.

Steps to repro (I think):

Create a Service for a UDP port (say 8125; in my case, this was statsd)

Create pods that bind to this Service

Start a client that uses the Service, e.g. via environment variable
which specifies the VIP for the Service. This client writes a UDP packet to
the Service every 10 seconds using the same socket (e.g. DialUDP is only
called once)

Log in to the node. Use ngrep -q '' udp port 8125 to view the outgoing
UDP traffic from the box to the service port. Observe that rewriting has
occurred, and the packet is destined to one of the concrete endpoints
specified in the Kubernetes model.

Delete that endpoint by, for example, killing the corresponding Pod
which the client is communicating with

Observe that the Pod's IP no longer appears in endpoints (e.g. using
kubectl)

BUG: Observe that ngrep continues to report that the packets are being
rewritten to the old endpoint.

BUG (corollary): Observe that conntrack
http://conntrack-tools.netfilter.org/ -d SERVICE_VIP shows that same
socket being routed to the old endpoint.

WORKAROUND: Restart the client Pod, or have the client Pod call DialUDP
each time it needs to send data.

The only solution i can see is that kube-proxy, when it rewrites iptables
rules based on endpoints changes, needs to reset connections between local
sockets and destroyed endpoints.

This didn't happen with the userspace kube-proxy, because kube-proxy was
accepting the packets locally regardless of the endpoints, and would always
use the latest endpoints information to forward the packet on.

Sorry for the long bug report, but I think it should be pretty clear by
now if you've made it here. :)

—
Reply to this email directly or view it on GitHub
#19029.

joshk0 · 2015-12-23T19:24:08Z

I guess we could use conntrack -D -p udp probably with -r (but I'd have
to try it) to kill NAT entries. Seems like a pretty easy project for
someone to tackle.

Yeah, this is basically my suggestion here. With the userspace kube-proxy in this scenario, as long as the proxy itself stays up, the endpoints can rotate out without affecting reachability.

With iptables, when an endpoint rotates out, the socket will continue connecting to a dead endpoint. So that's a concrete downside of iptables mode that is pretty hard to figure out without a pretty involved debug session like the one i did for the OP, thus I think kube-proxy should try to do something about it.

thockin · 2016-01-19T22:19:34Z

Renaming to better reflect issue

erimatnor · 2016-02-29T20:58:58Z

Running into this issue with pretty serious consequences for DNS resolution. Some nodes in our cluster needed a reboot to update to a newer CoreOS version. One of the nodes was running a DNS pod. To minimize the effect on services, I first scaled the DNS replication controller to two instances. Then I rebooted the nodes one by one. After the update, I noticed that some services on nodes that didn't need a reboot had trouble resolving the new addresses of services on rebooted nodes. The problem appeared to be related to some stale connection tracking state on the node that routed DNS packets to the wrong/old DNS pod IP.

The result was that services couldn't find the new addresses of pods on rebooted nodes since they were still querying an old DNS pod IP.

thockin · 2016-03-01T05:37:58Z

Yeah. I'd love to get a patch to handle this. I'm personally buried right now and there's no way I will get to this in the immediate future.

This is a great community project - someone out there must be interested in networking stuff and wants to contribute....

I'll also tag @freehan in case he has cycles, but this is not as high prio as the myriad other things I know you have going on, too.

freehan · 2016-03-01T07:03:10Z

I have cycles. I can take a look.

thockin · 2016-03-04T00:55:48Z

Minhan, is this something you're still hoping to look at, or overflowed?

On Mon, Feb 29, 2016 at 11:03 PM, Minhan Xia notifications@github.com
wrote:

I have cycles. I can take a look.

—
Reply to this email directly or view it on GitHub
#19029 (comment)
.

freehan · 2016-03-04T01:00:57Z

I will submit PR shortly.

thockin · 2016-03-04T01:04:02Z

oh, fantastic. Way better answer than I expected.

On Thu, Mar 3, 2016 at 5:01 PM, Minhan Xia notifications@github.com wrote:

I will submit PR shortly.

—
Reply to this email directly or view it on GitHub
#19029 (comment)
.

Automatic merge from submit-queue Flush conntrack state for removed/changed UDP Services fixes: #19029

shamil · 2016-09-04T23:01:29Z

Still happens to me, at least in Node.js. Each time I recreate PODs which are part of UDP service I also have to restart the Node.js PODs.

Using k8s v1.3.6 provisioned by kops

thockin · 2016-09-05T04:10:01Z

@shamil Can you please open a new issue and post a repro case, as simple as you can make it.

Thanks

@girishkalele @kubernetes/sig-network

dlouzan · 2016-11-07T14:23:56Z

@thockin @shamil
Hello guys, sorry for the necro-bump, I think I am facing the same issue (#26309 (comment)), I see that the ticket is closed and @thockin asked @shamil to open a new issue, but I couldn't find any, what is the status? Thank you.

Fixes #157 kubernetes/kubernetes#19029 kubernetes/kubernetes#22573

fabioy added kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 6, 2016

dchen1107 added team/cluster and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 7, 2016

thockin changed the title ~~iptables kube-proxy, UDP, long lived socket issues~~ iptables kube-proxy could handle UDP backend changes better Jan 19, 2016

thockin added the help-wanted label Jan 19, 2016

freehan self-assigned this Mar 1, 2016

freehan mentioned this issue Mar 5, 2016

Flush conntrack state for removed/changed UDP Services #22573

Merged

k8s-github-robot closed this as completed in #22573 Apr 20, 2016

k8s-github-robot pushed a commit that referenced this issue Apr 20, 2016

Merge pull request #22573 from freehan/udpproxy

3b2aae8

Automatic merge from submit-queue Flush conntrack state for removed/changed UDP Services fixes: #19029

dlouzan mentioned this issue Nov 7, 2016

Using nginx 1.9 DNS resolver output error periodically #26309

Closed

mqliang mentioned this issue Sep 11, 2017

flush conntrack state for removed/changed UDP Services cloudnativelabs/kube-router#157

Closed

murali-reddy added a commit to cloudnativelabs/kube-router that referenced this issue Dec 24, 2017

Flush conntrack entry when UDP service endpoint is deleted

f424e1a

Fixes #157 kubernetes/kubernetes#19029 kubernetes/kubernetes#22573

murali-reddy mentioned this issue Dec 24, 2017

Flush conntrack entry when UDP service endpoint is deleted cloudnativelabs/kube-router#259

Merged

murali-reddy added a commit to cloudnativelabs/kube-router that referenced this issue Dec 24, 2017

Flush conntrack entry when UDP service endpoint is deleted (#259)

94a2ec7

Fixes #157 kubernetes/kubernetes#19029 kubernetes/kubernetes#22573

KomorkinMikhail mentioned this issue Oct 19, 2018

Conntrack entities for SCTP are not flushed for deleted Service endpoints #70020

Closed

rakeshdatta mentioned this issue Aug 3, 2023

kube-proxy still routing to dead coredns endpoint #119750

Closed

danwinship mentioned this issue Sep 5, 2024

Implement a kube-proxy conntrack reconciler #126130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iptables kube-proxy could handle UDP backend changes better #19029

iptables kube-proxy could handle UDP backend changes better #19029

joshk0 commented Dec 22, 2015

thockin commented Dec 23, 2015

joshk0 commented Dec 23, 2015

thockin commented Jan 19, 2016

erimatnor commented Feb 29, 2016

thockin commented Mar 1, 2016

freehan commented Mar 1, 2016

thockin commented Mar 4, 2016

freehan commented Mar 4, 2016

thockin commented Mar 4, 2016

shamil commented Sep 4, 2016

thockin commented Sep 5, 2016

dlouzan commented Nov 7, 2016

iptables kube-proxy could handle UDP backend changes better #19029

iptables kube-proxy could handle UDP backend changes better #19029

Comments

joshk0 commented Dec 22, 2015

thockin commented Dec 23, 2015

joshk0 commented Dec 23, 2015

thockin commented Jan 19, 2016

erimatnor commented Feb 29, 2016

thockin commented Mar 1, 2016

freehan commented Mar 1, 2016

thockin commented Mar 4, 2016

freehan commented Mar 4, 2016

thockin commented Mar 4, 2016

shamil commented Sep 4, 2016

thockin commented Sep 5, 2016

dlouzan commented Nov 7, 2016