-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] KEP-4963: Kube-proxy Services Acceleration #128392
base: master
Are you sure you want to change the base?
Conversation
/cc @npinaeva @danwinship |
nice - now compare with eBPF? |
It is not really about ebpf or nftables, is about the kernel path and how to avoid the "slow paths", I expect an eBPF implementation that shortcut the kernel to have the same performance. Both netfilter/flowtables and eBPF (at least in cilium) hook in the tc subsystem and redirect the traffic to the output interface directly, shortcutting the kernel, AFAIK they do the same. The problem with eBPF, for doing just Services, you need to maintain a considerable amount of code, duplicate existing kernel features, consume more resources, ... this is barely 20 lines of code To recap, we had different slow paths in kube-proxy
|
Much of Cilium's code is based on the tc hooks, but the standard eBPF shortcutting trick involves directly splicing two sockets together. In particular, that means the packet bypasses the network stack entirely, whereas the approach here only bypasses the network stack in the host netns. (The packets still traverse the network stack normally inside both containers.) OTOH, the socket splicing trick, by definition, only works for traffic within a single node. Doing shortcutting for node-ingress/node-egress traffic requires a separate trick (which should be more-or-less equivalent to this one). |
For completeness on what Dan said, the way of achieving performance in the datapath is just avoiding overhead processing the packets in the OS. The most interesting case in network performance I've seen so far is on AI/ML workloads, those that don't even use TCP/IP because is "slow", and they use RDMA to bypass the OS and CPU, transferring the data between application memories. But coming back to the TCP/IP world, AFAIK there are two common techniques to shortcut the path through the OS network stack by bypassing the kernel (adding some links there
|
@aojea nice! I was actually just prototyping this in Calico last week as well, coincidentally. Changes look rather similar, although I've scoped the Calico offload rule to include non-service traffic as well so long as it matches a Calico owned interface. My understanding is that I should just be able to adjust the Calico table's priority to be slightly after kube-proxy's, which would ensure that kube-proxy flow offload is checked first for Service traffic and Calico can still offload other non-service traffic.
🥳 |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
It doesn't make sense to add it to the iptables |
Let's take the angle of "Kubernetes Service acceleration", this feature is just "accelerating the Kubernetes Services traffic" and for that it need access to the ClusterIPs Set to avoid to interfere with any other network components on the nodes. The kernel infrastructure only expose this functionality by using interface names, but I can see how it could have been exposed via connections and the neftables/netfilter/kenerl do the heavy lifting of identifying the network interfaces and adding the corresponding flows to the tables, as the skbuff IIRC will have both interfaces associated ... if this will have been exposed this way we'll have the same functionality but without different requirements ... Anyway, I think is worth discussing, let's talk more in Kubecon and try to get more feedback ... |
/triage accepted |
Oh, so, not exactly the same thing, but our perf team here had done some testing with OVS offload to Mellanox NICs, and decided that it only makes sense for services with long-lived connections. If your service has lots of short-lived connections, then you end up spending more time configuring offload than you save by having offloaded the connection. (In the limiting case, if the connection closes immediately after you set up the offload, then you just completely wasted your time.) It's possible that this tradeoff applies more to hardware offload, where there are additional setup/cleanup steps required beyond what's required for the purely software offload case, but anyway, this feature could potentially benefit from having a hint on the Service that it has long-lived connections (and then it makes more sense at the kube-proxy level too...) |
It brings the flowtables functionality
The kernel implement a flowtables infrastructure that allow to accelerate packet forwarding using nftables. Since connections with a small number of packets will not benefit of the offloading, add a new option to the nftables kube-proxy so users can define the minimum number of packets required to offload a connection to the datapath. It also allow users to completely disable the behavior by setting this option to 0. By default connections with more than 20 packets are considered large connections and offloaded to the fastpath.
@aojea: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the first look, I thought we will have to explicitly clear fast path when the destination pod is removed, just like conntrack-udp cleanup. But it turns out it is in sync with conntrack table, I experimented with udp services to test it, after conntrack cleanup traffic was not blackholed 😄
/kind feature
Use the netfilter flowtable architecture to offload all the Services
traffic that has been established.
This present some real interesting wins, in a kind cluster using
kube-proxy nftables:
https://docs.kernel.org/networking/nf_flowtable.html
How to test it
Edit the kube-proxy config to use flowtables in all interfaces
Apply the changes
Development workflow
For those using kind or kube-proxy images, you don't need to rebuild the kind cluster internally, just from the kubernetes repo do
You'll have the tarball with the kube-proxy image that you can load directly into kind
or locally to retag it and replace the existing image