-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a kube-proxy conntrack reconciler #126130
Comments
/sig network |
It is important to understand the performance implications, as listing all the conntrack entries is expensive, I remember talking with Florian Westphal (Kernel developer) about some way of doing server side filtering in the kernel, and avoid having to dump the entire conntrack table and filter flow by flow, but I don't know what is the state of this |
The netlink interface to conntrack has the ability to get specific flows and dump all flows that match a given mark and mask. |
/triage accepted |
/kind bug Despite I added the feature label, is basically rewriting the existing logic to fix the limitations of current implementation |
Hi @aroradaman! Just checking in, as we talked about this one on the SIG network sync today: Have any questions or need any support with this one? |
Was waiting for vishvananda/netlink#989 to be merged, will resume the work on this now. |
Prior to Kubernetes 1.28 we were bitten by a conntrack issue on UDP services with iTP set to Local. We fixed this by switching iTP to Cluster. So I appreciate you working on the reconciler. I'm looking forward to this problem going away. |
I've been trying to get my head around this problem some more. In AWS (with the AWS CNI) on Kubernetes v1.29.6 (installed by kOps) I seem to be seeing a behaviour where a conntrack entry is being left behind by the CNI. Our situation is that we multiple pods, many clients and a single server (the server is a DaemonSet). When everything is working, I see a conntrack entry:
When the server pod is deleted, I see these kube-proxy logs:
After this time, I have the following conntrack entry:
(notice the mark=128) Later on, when the new server Pod has started, I see the following:
I'm not too familiar with this entire process, but it seems like the AWS-CNI is adding a conntrack entry, which isn't being cleaned up by kube-proxy. I'm not sure if the new reconciler will take this situation into account, so I thought I'd mention it. |
Same setup as before, but in GCP with GKE v1.29.6-gke.1326000: The clients are streaming UDP over 8125 to the server. When everything is working, I see a conntrack entry:
When the server pod is deleted, I see these kube-proxy logs:
Once the Pod starts up again, I have the incorrect conntrack entry:
(notice the If I delete it, a new one appears:
I'm unsure what it causing that incorrect conntrack entry, I assume this is the blackhole route problem that has been plaguing kube-proxy. |
in 1.28 there was a refactor of the conntrack logic #116104 but I think is correct, and I couldn't find anything suspicious
no nat is performed so it keeps the original IP
that is the client We have a test that exercise this scenario, at the moment that the new pod is created, in your case the one with
@adrianmoisey is this a reproducer or a hypothesis? if you have a repro in gke let me know so I can take a look |
We noticed a behaviour change between 1.28 and 1.29, which I suspect may be related to #119394
I can reproduce in GKE v1.29.6-gke.1326000 |
Linking the PR to remove dependency from conntrack binary here - #126847 |
With vishvananda/netlink#989 we can delete all the conntrack entries in single GET/DUMP conntrack table command, reducing the number of netlink calls. That will require a change in the conntrack.Interface to have single generic method to clear all entries, should I include that in this PR or a new one after this gets in?
|
My opinion is to include it in a new PR. |
I don't think we handle the conntrack cleanup for services that transition from eTP to iTP. Ideally after the transition, all the entries DNATted to cluster endpoints should be removed, will also add an e2e for the same. |
Currently, the service changes(traffic policy and topology) only apply to the new connections, with the reconciler we have the opportunity to apply the changes to the existing connection as well. |
with the reconciler we can fix all of that as we no longer will depend on the events, so we don't care about transitions anymore. Reconciler will check desired state by parsing all the services and its characteristics, will dump the conntrack table , and delete the conntrack entries we know will blackhole traffic ... it is UDP so we don't need to react immediately |
I'm not sure there is really a good use case for switching a live service between eTP and iTP, but if we do care about that, it seems more consistent to me to preserve existing connections rather than killing them. EDIT: I think by "transition from eTP to iTP", Daman meant "transition either eTP or iTP between Local and Cluster", and that's certainly what I meant... |
Speaking about changing iTP from Cluster to Local here: In this particular use case, I'd expect the blackhole contract entries to be flushed, at some point after the ITP changes, in order to send traffic to the local Pod. I can't say for certain if my expectation matches other use cases though. |
Currently, when a service is deleted, we delete all existing DNAT records for its ClusterIP ( Then again, I'm not actually sure why we delete those entries... the original UDP conntrack problem bug report (#19029) only described the problem that is fixed by the (I guess it has the effect of immediately killing UDP Service connections when you delete the Service, even if the endpoints still exist, but that behavior is the opposite of how TCP Services behave, so that doesn't seem like a good argument for keeping it...) |
I'm sorry for being pedantic but there is no connections in UDP, we delete a conntrack entry, and that does not have any problem, we are not breaking traffic .... an application that is using UDP can not make the assumption that has a connection and is always going to hit the same backend |
I'm not talking about "same backend", I'm talking about "at all". Deleting a TCP Service while its endpoints are still running results in graceful termination of its clients; you can't create new connections, but existing connections continue to function for as long as the endpoints are there. Deleting a UDP Service while its endpoints are still running results in immediate termination of all clients; further packets using previously-established 5-tuples will be dropped. And this is something the code does deliberately; it could implement immediate termination in the TCP case (by deleting conntrack records for deleted TCP ClusterIPs too), or graceful termination in the UDP case (by not deleting conntrack records for deleted UDP ClusterIPs), but it doesn't. And this doesn't seem to be have been an intentional decision. (And it only applies to ClusterIPs! ExternalIPs, LB IPs, and NodePorts get graceful termination for both TCP and UDP.) |
Akshually... I recall having this discussion in the sig a couple years ago. I'm on mobile now but I can try to find the issue later. We talked about whether it is correct to leave TCP connections open when the service has been deleted, or whether we should break it from the middle. Does NOT break TCP sessions the way you think it might. I was pretty surprised. I tried but I did not find an obvious way to break a TCP session from the middle, short of a black hole. IIRC we didn't really come to a conclusion about whether it's right or wrong, but it didn't matter because we didn't have a mechanism. Maybe ebpf has a better way of doing this, or maybe using the netlink library, rather than the conntrack tool? |
doh, I see it now, I was not getting the point , sorry ... but we never need to delete entries if there are no backends, we only need to delete entries when there is at least one backend, because the other entries may blackhole the legit traffic, if there are no backends we let the packets flow @thockin I think you mean this #108523
yeah, it does |
@thockin in this case we don't want to leave the connection hanging, since TCP has long timers, even if we achieve to delete the entry I think we want to signal both ends sending a RST so they don't leave sockets open |
ebpf programs can read arbitrary conntrack entries, while netfilter rules can only see the results of applying the current conntrack entries to the current connection... but it doesn't seem like reading arbitrary entries would really be helpful. For nftables kube-proxy, "Drop any packets to IPs in the ServiceCIDR range that aren't currently-active ClusterIPs" is just a recombination of the two behaviors in #122692, and should be doable with one or two new chains with the right hook+priority, with a single rule.
The big thing the netlink library gives us is that we can delete arbitrary sets of entries (in particular, now we can delete blackhole entries for a given IP without also deleting DNAT entries for that IP, which we couldn't do before, #122741), so if the problem before had been "there's no way to delete just the conntrack entries we want to delete", then the library should help. But if the problem was "I deleted all the conntrack entries related to the cluster IP and open connections still stayed open" then the netlink library probably won't help. |
FTR: #108523 (comment) and down was an interesting refresher on the stat of things.
I repeated the experiment in #108523 (comment) and got the same results, whether client and server were on the same node or different (through a clusterIP). Perhaps So again, if the intention is to break a TCP connection "from the middle", I don't think conntrack is the mechanism, or at least I am not using it right. |
Can we set custom marks on conntrack entries that have been DNATed by the proxy? Currently, we use marking at the packet level, but this is limited to masquerade hairpin traffic, and we also remove the mark in post-routing. By using |
ClusterIPs can be filtered out using ServiceCIDRs, CONNMARK will help to distinguish DNAT flows for LoadBalancerIP, ExternalIP and NodePorts that are owned by proxy. The current approach which I'm experimenting is clearing flow entries which have ClusterIP/ExternalIP/LoadBalancerIP [--orig-dst] or NodePort [--orig-port-dst] as origin and DNATed destination [--reply-src] doesn't belong to any of the serving endpoints. |
I have two concerns with this approach:
I think is correct, it is simple and solve all the problems: And UDP entry that is stale on an active Service, by active it has at least one endpoint, should be deleted any sync period |
Yes, although the situation when talking about conntrack is a little bit better than with the skb mark; we can use
I disagree. If you, as an admin, don't want to break kubernetes networking then don't break kubernetes networking. There are a million different ridiculous things someone could do as |
Just one thought, given not every UDP service is bidirectional(things like udp programs that listen for heartbeats don’t do request/reply back on the same tuple, they’re just one way traffic), for udp services If we move to deleting the flow entry(even if it’s perceived as “stale” because there was no reply for the first packet) from conntrack, that just causes churn in the conntrack table. When the next packet comes, it will be natted to a new port tuple. Also, what about scalability, given a large number of UDP flows, how scalable is it to delete all the stale flows? If you move to the other kube-proxy backends like ipvs, or even using external load balancers, this expectation for bidirectional udp traffic doesn’t exist. As such, to be consistent, expecting bidirectional udp traffic should be optional per service or at least stated as a limitation of the iptables backend that “all udp services deployed in kubernetes should be bidirectional, and every first packet requires a reply packet or traffic on a specific flow may black hole”. |
if there is a rule that matches the conntrack entry then is not perceived as stale, the logic is not based on if there is response of no, is about there is a conntrack entry that does not have a rule associated to it |
kubernetes/pkg/proxy/iptables/proxier.go Line 1527 in 16f9fdc
I’ve seen this and other cases where packets are only forwarded if they are in “related” or “established” state in conntrack and dropped if they’re in “invalid” state. In practice, for udp services this means that the first packet must have a reply or the flow leaves the “new” state and all subsequent inbound traffic is dropped, as the flow never enters “established”. What I’m trying to state but using better words here: https://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#UDPCONNECTIONS And the code in netfilter: https://github.com/torvalds/linux/blob/master/net/netfilter/nf_conntrack_proto_udp.c#L104 |
Now that this has landed in master, should it also be cherry-picked into previous versions? |
no, it's too big to cherry-pick. |
What would you like to be added?
We'd like to have conntrack reconciler for Services in kube-proxy, for each Service or Endpoint change it should reconcile the Service/Endpoint table with the existing conntrack table and remove stale UDP entries (we have concluded in other issues that UDP is the only protocol that requires this)
Why is this needed?
Despite a lot of work and testing was added to solve the problem with stale UDP entries that can blackhole traffic, we still have reports of issues related to UDP stale entries
#125467
Also, the conntrack logic is event based and, as expected, if something fails it never reconciles
#112604
The reconciler will only delete the entries that are known to be stale, other entries will time out.
References
Some WIP I had https://github.com/kubernetes/kubernetes/compare/master...aojea:kubernetes:conntrack_done_right?expand=1
Related issues that will be fixed
The text was updated successfully, but these errors were encountered: