[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

aojea · 2024-10-28T17:30:48Z

/kind feature

kube-proxy, on nftables mode, allows to offload the Service traffic to the fastpath defined by the Kernel flowtables infrastructure. Users can define a threshold based on the number packets per connection since the connection will be offloaded by the datapath. Because small connections may not benefit or it can also regress in performance, the default threshold is 20 packets. The feature can be disabled by setting the threshold to 0.

Use the netfilter flowtable architecture to offload all the Services
traffic that has been established.

This present some real interesting wins, in a kind cluster using
kube-proxy nftables:

no offloading

[  1] local 10.244.0.8 port 52080 connected with 10.96.117.116 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec  4.64 GBytes  39.9 Gbits/sec
[  1] 1.00-2.00 sec  4.88 GBytes  41.9 Gbits/sec
[  1] 2.00-3.00 sec  4.81 GBytes  41.3 Gbits/sec
[  1] 3.00-4.00 sec  4.75 GBytes  40.8 Gbits/sec
[  1] 4.00-5.00 sec  4.86 GBytes  41.7 Gbits/sec
[  1] 5.00-6.00 sec  4.85 GBytes  41.7 Gbits/sec
[  1] 6.00-7.00 sec  4.91 GBytes  42.2 Gbits/sec
[  1] 7.00-8.00 sec  4.84 GBytes  41.6 Gbits/sec
[  1] 8.00-9.00 sec  4.80 GBytes  41.2 Gbits/sec
[  1] 9.00-10.00 sec  4.75 GBytes  40.8 Gbits/sec
[  1] 0.00-10.01 sec  48.1 GBytes  41.3 Gbits/sec

offloading

[  1] local 10.244.0.8 port 53332 connected with 10.96.117.116 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec  5.57 GBytes  47.8 Gbits/sec
[  1] 1.00-2.00 sec  5.52 GBytes  47.4 Gbits/sec
[  1] 2.00-3.00 sec  5.63 GBytes  48.4 Gbits/sec
[  1] 3.00-4.00 sec  5.50 GBytes  47.3 Gbits/sec
[  1] 4.00-5.00 sec  5.53 GBytes  47.5 Gbits/sec
[  1] 5.00-6.00 sec  5.61 GBytes  48.2 Gbits/sec
[  1] 6.00-7.00 sec  5.57 GBytes  47.9 Gbits/sec
[  1] 7.00-8.00 sec  5.65 GBytes  48.5 Gbits/sec
[  1] 8.00-9.00 sec  5.61 GBytes  48.2 Gbits/sec
[  1] 9.00-10.00 sec  5.57 GBytes  47.9 Gbits/sec
[  1] 0.00-10.01 sec  55.8 GBytes  47.9 Gbits/sec

https://docs.kernel.org/networking/nf_flowtable.html

How to test it

build a kind node image

kind build node-image

create a cluster with the new image using nftables proxy

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  kubeProxyMode: "nftables"
nodes:
- role: control-plane

kind create cluster --image kindest/node:latest  --config src/yamls/kind-nftables.yaml

Edit the kube-proxy config to use flowtables in all interfaces

 kubectl edit cm kube-proxy -n kube-system

    kind: KubeProxyConfiguration
    mode: nftables
    nftables:
      acceleratedInterfaceExpression: "interface.name != \"\""

Apply the changes

kubectl rollout restart ds kube-proxy -n kube-system

Deploy two pods, a client and a server

 kubectl run client --image=registry.k8s.io/e2e-test-images/agnhost:2.53
 kubectl run server --image=registry.k8s.io/e2e-test-images/agnhost:2.53

Expose the port 5001 through a Service in pod server

 kubectl expose pod server --port=5001

in one window, set up an iperf server in the server pod

 kubectl exec -it server -- iperf -s

In other window, run an iperf client against the pod IP (IMPORTANT), Pod IP here, this path is not accelerated

 kubectl exec -it client -- iperf -c 10.244.0.5
------------------------------------------------------------
Client connecting to 10.244.0.5, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.244.0.6 port 41054 connected with 10.244.0.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  49.5 GBytes  42.5 Gbits/sec

Now run the same test against the ClusterIP

kubectl exec -it test -- iperf -c 10.96.157.37
------------------------------------------------------------
Client connecting to 10.96.157.37, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.244.0.6 port 44274 connected with 10.96.157.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.01 sec  56.1 GBytes  48.2 Gbits/sec

Development workflow

For those using kind or kube-proxy images, you don't need to rebuild the kind cluster internally, just from the kubernetes repo do

make quick-release-images

You'll have the tarball with the kube-proxy image that you can load directly into kind

$ kind load image-archive _output/release-images/amd64/kube-proxy.tar 
$ kubectl rollout restart ds kube-proxy -n kube-system

or locally to retag it and replace the existing image

$ docker load < _output/release-images/amd64/kube-proxy.tar
Loaded image: registry.k8s.io/kube-proxy-amd64:v1.32.0-alpha.2.328_c4102eb7359925
$ docker tag registry.k8s.io/kube-proxy-amd64:v1.32.0-alpha.2.328_c4102eb7359925 registry.k8s.io/kube-proxy-amd64:test
$ kubectl -n kube-system set image ds kube-proxy kube-proxy=registry.k8s.io/kube-proxy-amd64:test

aojea · 2024-10-28T17:43:57Z

/cc @npinaeva @danwinship

thockin · 2024-10-28T20:05:54Z

nice - now compare with eBPF?

aojea · 2024-10-28T21:30:33Z

nice - now compare with eBPF?

It is not really about ebpf or nftables, is about the kernel path and how to avoid the "slow paths", I expect an eBPF implementation that shortcut the kernel to have the same performance. Both netfilter/flowtables and eBPF (at least in cilium) hook in the tc subsystem and redirect the traffic to the output interface directly, shortcutting the kernel, AFAIK they do the same. The problem with eBPF, for doing just Services, you need to maintain a considerable amount of code, duplicate existing kernel features, consume more resources, ... this is barely 20 lines of code

To recap, we had different slow paths in kube-proxy

Data plane programming, sending the whole blob of rules was slow, @danwinship implemented the partial syncs and I think there were some other improvements on the userspace @npinaeva helped to discover, IIRC we are now in a good state in this area ...
the Service match was a linear search, @danwinship fixed that using a map in nftables so we should be in O(1) vs O(n)

    map service-ips {
                type ipv4_addr . inet_proto . inet_service : verdict
                comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
                elements = { 10.96.0.10 . tcp . 53 : goto service-NWBZK7IH-kube-system/kube-dns/tcp/dns-tcp,
                             10.96.0.10 . udp . 53 : goto service-FY5PMXPG-kube-system/kube-dns/udp/dns,
                             10.96.117.116 . tcp . 80 : goto service-6THF4ISO-default/service/tcp/http,
                             10.96.221.118 . tcp . 80 : goto service-WRR5O5HH-default/test3/tcp/,
                             10.96.210.227 . tcp . 90 : goto service-6CKWZ3LT-default/test/tcp/,
                             10.96.117.116 . tcp . 5001 : goto service-LYULNVLC-default/service/tcp/iperf,
                             10.96.0.1 . tcp . 443 : goto service-2QRHZV4L-default/kubernetes/tcp/https,
                             10.96.0.10 . tcp . 9153 : goto service-AS2KJYAD-kube-system/kube-dns/tcp/metrics }
        }

The "performance" as throughput is fixed with this PR, is not likely you can go faster, bits need to travel from one place to another :)

pkg/proxy/nftables/proxier.go

danwinship · 2024-10-28T22:26:23Z

Both netfilter/flowtables and eBPF (at least in cilium) hook in the tc subsystem and redirect the traffic to the output interface directly, shortcutting the kernel, AFAIK they do the same.

Much of Cilium's code is based on the tc hooks, but the standard eBPF shortcutting trick involves directly splicing two sockets together. In particular, that means the packet bypasses the network stack entirely, whereas the approach here only bypasses the network stack in the host netns. (The packets still traverse the network stack normally inside both containers.)

OTOH, the socket splicing trick, by definition, only works for traffic within a single node. Doing shortcutting for node-ingress/node-egress traffic requires a separate trick (which should be more-or-less equivalent to this one).

aojea · 2024-10-29T11:59:21Z

For completeness on what Dan said, the way of achieving performance in the datapath is just avoiding overhead processing the packets in the OS.

The most interesting case in network performance I've seen so far is on AI/ML workloads, those that don't even use TCP/IP because is "slow", and they use RDMA to bypass the OS and CPU, transferring the data between application memories.

But coming back to the TCP/IP world, AFAIK there are two common techniques to shortcut the path through the OS network stack by bypassing the kernel (adding some links there

socket to socket with eBPF:
- https://www.usenix.org/conference/srecon23emea/presentation/sitnicki
- https://eunomia.dev/tutorials/29-sockops/
interface to interface:
- netfilter https://docs.kernel.org/networking/nf_flowtable.html
- bpf redirect https://arthurchiao.art/blog/differentiate-bpf-redirects/

caseydavenport · 2024-10-29T22:09:45Z

@aojea nice! I was actually just prototyping this in Calico last week as well, coincidentally.

Changes look rather similar, although I've scoped the Calico offload rule to include non-service traffic as well so long as it matches a Calico owned interface. My understanding is that I should just be able to adjust the Calico table's priority to be slightly after kube-proxy's, which would ensure that kube-proxy flow offload is checked first for Service traffic and Calico can still offload other non-service traffic.

The problem with eBPF, for doing just Services, you need to maintain a considerable amount of code, duplicate existing kernel features, consume more resources, ... this is barely 20 lines of code

🥳

cmd/kube-proxy/app/options.go

pkg/proxy/nftables/proxier.go

k8s-triage-robot · 2024-10-31T12:41:53Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

danwinship · 2024-10-31T14:50:50Z

(Related: the nftables flowtable rule does not depend on you using the nftables kube-proxy backend. It would work with iptables or ipvs too.)

you need nftables to program the dataplane AFAIK, so adding the nftables interface to the iptables proxy sounds a bit :/

It doesn't make sense to add it to the iptables proxy.Provider, but my point is that it doesn't make sense to add it to the nftables proxy.Provider either. The point of proxy.Provider is to watch for services and endpoints changing, and update the backend rules for those services and endpoints. The flowtable code has no dependencies on services or endpoints (or node topology labels). The nftables rules it adds do not interact with any of the other rules that the nftables proxier adds. The flowtable needs to be updated on an entirely separate schedule from when the service/endpoint rules need to be updated. Architecturally, it would make much more sense as a separate controller within kube-proxy (or outside of kube-proxy) that maintains its own table with its own rules and flowtable.

aojea · 2024-10-31T17:00:48Z

The flowtable code has no dependencies on services or endpoints (or node topology labels). The nftables rules it adds do not interact with any of the other rules that the nftables proxier adds

Let's take the angle of "Kubernetes Service acceleration", this feature is just "accelerating the Kubernetes Services traffic" and for that it need access to the ClusterIPs Set to avoid to interfere with any other network components on the nodes.

The kernel infrastructure only expose this functionality by using interface names, but I can see how it could have been exposed via connections and the neftables/netfilter/kenerl do the heavy lifting of identifying the network interfaces and adding the corresponding flows to the tables, as the skbuff IIRC will have both interfaces associated ... if this will have been exposed this way we'll have the same functionality but without different requirements ...

Anyway, I think is worth discussing, let's talk more in Kubecon and try to get more feedback ...

cici37 · 2024-10-31T20:08:06Z

/triage accepted

danwinship · 2024-11-07T17:09:21Z

This present some real interesting wins

Oh, so, not exactly the same thing, but our perf team here had done some testing with OVS offload to Mellanox NICs, and decided that it only makes sense for services with long-lived connections. If your service has lots of short-lived connections, then you end up spending more time configuring offload than you save by having offloaded the connection. (In the limiting case, if the connection closes immediately after you set up the offload, then you just completely wasted your time.) It's possible that this tradeoff applies more to hardware offload, where there are additional setup/cleanup steps required beyond what's required for the purely software offload case, but anyway, this feature could potentially benefit from having a hint on the Service that it has long-lived connections (and then it makes more sense at the kube-proxy level too...)

It brings the flowtables functionality

The kernel implement a flowtables infrastructure that allow to accelerate packet forwarding using nftables. Since connections with a small number of packets will not benefit of the offloading, add a new option to the nftables kube-proxy so users can define the minimum number of packets required to offload a connection to the datapath. It also allow users to completely disable the behavior by setting this option to 0. By default connections with more than 20 packets are considered large connections and offloaded to the fastpath.

k8s-ci-robot · 2024-12-09T11:50:49Z

@aojea: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-unit	`a211e4b`	link	true	`/test pull-kubernetes-unit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot · 2024-12-14T12:19:24Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aroradaman

From the first look, I thought we will have to explicitly clear fast path when the destination pod is removed, just like conntrack-udp cleanup. But it turns out it is in sync with conntrack table, I experimented with udp services to test it, after conntrack cleanup traffic was not blackholed 😄

k8s-ci-robot requested review from enj, logicalhan and a team October 28, 2024 17:31

k8s-ci-robot requested review from danwinship and npinaeva October 28, 2024 17:44

danwinship reviewed Oct 28, 2024

View reviewed changes

pkg/proxy/nftables/proxier.go Outdated Show resolved Hide resolved

danwinship reviewed Oct 28, 2024

View reviewed changes

pkg/proxy/nftables/proxier.go Outdated Show resolved Hide resolved

aojea force-pushed the flowtables branch from 5560fff to e2d838a Compare October 29, 2024 13:44

aojea mentioned this pull request Oct 29, 2024

add flowtable object kubernetes-sigs/knftables#18

Merged

aojea force-pushed the flowtables branch from f01b1d8 to e9c9e8b Compare October 31, 2024 10:39

adrianmoisey reviewed Oct 31, 2024

View reviewed changes

cmd/kube-proxy/app/options.go Outdated Show resolved Hide resolved

adrianmoisey reviewed Oct 31, 2024

View reviewed changes

pkg/proxy/nftables/proxier.go Outdated Show resolved Hide resolved

aojea force-pushed the flowtables branch from e9c9e8b to 6ff458f Compare October 31, 2024 13:28

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 31, 2024

caseydavenport mentioned this pull request Nov 8, 2024

[Prototype] nftables flowtable support projectcalico/calico#9458

Open

3 tasks

aojea mentioned this pull request Nov 15, 2024

Kube-proxy Services Acceleration kubernetes/enhancements#4963

Open

4 tasks

aojea mentioned this pull request Nov 28, 2024

KEP-4963: Kube-proxy Services Acceleration kubernetes/enhancements#4964

Open

aojea changed the title ~~Flowtables~~ [WIP] Flowtables Dec 6, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024

bump knftables to latest stable version

8ad9607

It brings the flowtables functionality

aojea force-pushed the flowtables branch from 6ff458f to 97d5334 Compare December 9, 2024 11:03

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2024

aojea force-pushed the flowtables branch from 97d5334 to a211e4b Compare December 9, 2024 11:04

aojea changed the title ~~[WIP] Flowtables~~ KEP-4963: Kube-proxy Services Acceleration Dec 9, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2024

aojea changed the title ~~KEP-4963: Kube-proxy Services Acceleration~~ [WIP] KEP-4963: Kube-proxy Services Acceleration Dec 9, 2024

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 9, 2024

aroradaman reviewed Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

aojea commented Oct 28, 2024 •

edited

Loading

aojea commented Oct 28, 2024

thockin commented Oct 28, 2024

aojea commented Oct 28, 2024

danwinship commented Oct 28, 2024

aojea commented Oct 29, 2024

caseydavenport commented Oct 29, 2024 •

edited

Loading

k8s-triage-robot commented Oct 31, 2024

danwinship commented Oct 31, 2024

aojea commented Oct 31, 2024 •

edited

Loading

cici37 commented Oct 31, 2024

danwinship commented Nov 7, 2024

k8s-ci-robot commented Dec 9, 2024

k8s-ci-robot commented Dec 14, 2024

aroradaman left a comment

[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

Are you sure you want to change the base?

[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

Conversation

aojea commented Oct 28, 2024 • edited Loading

How to test it

Development workflow

aojea commented Oct 28, 2024

thockin commented Oct 28, 2024

aojea commented Oct 28, 2024

danwinship commented Oct 28, 2024

aojea commented Oct 29, 2024

caseydavenport commented Oct 29, 2024 • edited Loading

k8s-triage-robot commented Oct 31, 2024

danwinship commented Oct 31, 2024

aojea commented Oct 31, 2024 • edited Loading

cici37 commented Oct 31, 2024

danwinship commented Nov 7, 2024

k8s-ci-robot commented Dec 9, 2024

k8s-ci-robot commented Dec 14, 2024

aroradaman left a comment

Choose a reason for hiding this comment

aojea commented Oct 28, 2024 •

edited

Loading

caseydavenport commented Oct 29, 2024 •

edited

Loading

aojea commented Oct 31, 2024 •

edited

Loading