Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iptables kube-proxy could handle UDP backend changes better #19029

Closed
joshk0 opened this issue Dec 22, 2015 · 12 comments · Fixed by #22573
Closed

iptables kube-proxy could handle UDP backend changes better #19029

joshk0 opened this issue Dec 22, 2015 · 12 comments · Fixed by #22573
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@joshk0
Copy link

joshk0 commented Dec 22, 2015

I encountered a strange issue after opting-in to kube-proxy iptables support.

Steps to repro (I think):

  1. Create a Service for a UDP port (say 8125; in my case, this was statsd)
  2. Create pods that bind to this Service
  3. Start a client Pod that uses the Service, e.g. via environment variable which specifies the VIP for the Service. This client writes a UDP packet to the Service every 10 seconds using the same socket (e.g. DialUDP is only called once)
  4. Log in to the node. Use ngrep -q '' udp port 8125 to view the outgoing UDP traffic from the box to the service port. Observe that rewriting has occurred, and the packet is destined to one of the concrete endpoints specified in the Kubernetes model.
  5. Delete that endpoint by, for example, killing the corresponding Pod which the client is communicating with
  6. Observe that the Pod's IP no longer appears in endpoints (e.g. using kubectl)
  7. BUG: Observe that ngrep continues to report that the packets are being rewritten to the old endpoint.
  8. BUG (corollary): Observe that conntrack -L -d SERVICE_VIP shows that same socket being routed to the old endpoint.
  9. WORKAROUND: Restart the client Pod, or have the client Pod call DialUDP each time it needs to send data.

The only solution i can see is that kube-proxy, when it rewrites iptables rules based on endpoints changes, needs to reset connections between local sockets and destroyed endpoints.

This didn't happen with the userspace kube-proxy, because kube-proxy was accepting the packets locally regardless of the endpoints, and would always use the latest endpoints information to forward the packet on.

Sorry for the long bug report, but I think it should be pretty clear by now if you've made it here. :)

@thockin
Copy link
Member

thockin commented Dec 23, 2015

With a UDP "connection" you can be sending packets to the old IP to your
heart's content and they will go nowhere and that is just part of using
UDP.

Take the proxy out of the picture: If you net.DialUDP to a non-existent
port and send data, it will happily "succeed" but there's nobody
listening. If you have a UDP "connection" to an IP:port and are sending
data and the remote side dies, your app doesn't really know (unless the
protocol you define atop UDP detects it).

I guess we could use conntrack -D -p udp probably with -r (but I'd have
to try it) to kill NAT entries. Seems like a pretty easy project for
someone to tackle.

On Tue, Dec 22, 2015 at 1:32 PM, Joshua Kwan notifications@github.com
wrote:

I encountered a strange issue after opting-in to kube-proxy iptables
support.

Steps to repro (I think):

  1. Create a Service for a UDP port (say 8125; in my case, this was statsd)
  2. Create pods that bind to this Service
  3. Start a client that uses the Service, e.g. via environment variable
    which specifies the VIP for the Service. This client writes a UDP packet to
    the Service every 10 seconds using the same socket (e.g. DialUDP is only
    called once)
  4. Log in to the node. Use ngrep -q '' udp port 8125 to view the outgoing
    UDP traffic from the box to the service port. Observe that rewriting has
    occurred, and the packet is destined to one of the concrete endpoints
    specified in the Kubernetes model.
  5. Delete that endpoint by, for example, killing the corresponding Pod
    which the client is communicating with
  6. Observe that the Pod's IP no longer appears in endpoints (e.g. using
    kubectl)
  7. BUG: Observe that ngrep continues to report that the packets are being
    rewritten to the old endpoint.
  8. BUG (corollary): Observe that conntrack
    http://conntrack-tools.netfilter.org/ -d SERVICE_VIP shows that same
    socket being routed to the old endpoint.
  9. WORKAROUND: Restart the client Pod, or have the client Pod call DialUDP
    each time it needs to send data.

The only solution i can see is that kube-proxy, when it rewrites iptables
rules based on endpoints changes, needs to reset connections between local
sockets and destroyed endpoints.

This didn't happen with the userspace kube-proxy, because kube-proxy was
accepting the packets locally regardless of the endpoints, and would always
use the latest endpoints information to forward the packet on.

Sorry for the long bug report, but I think it should be pretty clear by
now if you've made it here. :)


Reply to this email directly or view it on GitHub
#19029.

@joshk0
Copy link
Author

joshk0 commented Dec 23, 2015

I guess we could use conntrack -D -p udp probably with -r (but I'd have
to try it) to kill NAT entries. Seems like a pretty easy project for
someone to tackle.

Yeah, this is basically my suggestion here. With the userspace kube-proxy in this scenario, as long as the proxy itself stays up, the endpoints can rotate out without affecting reachability.

With iptables, when an endpoint rotates out, the socket will continue connecting to a dead endpoint. So that's a concrete downside of iptables mode that is pretty hard to figure out without a pretty involved debug session like the one i did for the OP, thus I think kube-proxy should try to do something about it.

@fabioy fabioy added kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 6, 2016
@dchen1107 dchen1107 added team/cluster and removed sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jan 7, 2016
@thockin
Copy link
Member

thockin commented Jan 19, 2016

Renaming to better reflect issue

@thockin thockin changed the title iptables kube-proxy, UDP, long lived socket issues iptables kube-proxy could handle UDP backend changes better Jan 19, 2016
@erimatnor
Copy link

Running into this issue with pretty serious consequences for DNS resolution. Some nodes in our cluster needed a reboot to update to a newer CoreOS version. One of the nodes was running a DNS pod. To minimize the effect on services, I first scaled the DNS replication controller to two instances. Then I rebooted the nodes one by one. After the update, I noticed that some services on nodes that didn't need a reboot had trouble resolving the new addresses of services on rebooted nodes. The problem appeared to be related to some stale connection tracking state on the node that routed DNS packets to the wrong/old DNS pod IP.

The result was that services couldn't find the new addresses of pods on rebooted nodes since they were still querying an old DNS pod IP.

@thockin
Copy link
Member

thockin commented Mar 1, 2016

Yeah. I'd love to get a patch to handle this. I'm personally buried right now and there's no way I will get to this in the immediate future.

This is a great community project - someone out there must be interested in networking stuff and wants to contribute....

I'll also tag @freehan in case he has cycles, but this is not as high prio as the myriad other things I know you have going on, too.

@freehan
Copy link
Contributor

freehan commented Mar 1, 2016

I have cycles. I can take a look.

@freehan freehan self-assigned this Mar 1, 2016
@thockin
Copy link
Member

thockin commented Mar 4, 2016

Minhan, is this something you're still hoping to look at, or overflowed?

On Mon, Feb 29, 2016 at 11:03 PM, Minhan Xia notifications@github.com
wrote:

I have cycles. I can take a look.


Reply to this email directly or view it on GitHub
#19029 (comment)
.

@freehan
Copy link
Contributor

freehan commented Mar 4, 2016

I will submit PR shortly.

@thockin
Copy link
Member

thockin commented Mar 4, 2016

oh, fantastic. Way better answer than I expected.

On Thu, Mar 3, 2016 at 5:01 PM, Minhan Xia notifications@github.com wrote:

I will submit PR shortly.


Reply to this email directly or view it on GitHub
#19029 (comment)
.

k8s-github-robot pushed a commit that referenced this issue Apr 20, 2016
Automatic merge from submit-queue

Flush conntrack state for removed/changed UDP Services

fixes: #19029
@shamil
Copy link

shamil commented Sep 4, 2016

Still happens to me, at least in Node.js. Each time I recreate PODs which are part of UDP service I also have to restart the Node.js PODs.

Using k8s v1.3.6 provisioned by kops

@thockin
Copy link
Member

thockin commented Sep 5, 2016

@shamil Can you please open a new issue and post a repro case, as simple as you can make it.

Thanks

@girishkalele @kubernetes/sig-network

@dlouzan
Copy link

dlouzan commented Nov 7, 2016

@thockin @shamil
Hello guys, sorry for the necro-bump, I think I am facing the same issue (#26309 (comment)), I see that the ticket is closed and @thockin asked @shamil to open a new issue, but I couldn't find any, what is the status? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants