Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] KEP-4963: Kube-proxy Services Acceleration #128392

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aojea
Copy link
Member

@aojea aojea commented Oct 28, 2024

/kind feature

kube-proxy, on nftables mode, allows to offload the Service traffic to the fastpath defined by the Kernel flowtables infrastructure. Users can define a threshold based on the number packets per connection since the connection will be offloaded by the datapath. Because small connections may not benefit or it can also regress in performance, the default threshold is 20 packets. The feature can be disabled by setting the threshold to 0.

Use the netfilter flowtable architecture to offload all the Services
traffic that has been established.

This present some real interesting wins, in a kind cluster using
kube-proxy nftables:

  • no offloading
[  1] local 10.244.0.8 port 52080 connected with 10.96.117.116 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec  4.64 GBytes  39.9 Gbits/sec
[  1] 1.00-2.00 sec  4.88 GBytes  41.9 Gbits/sec
[  1] 2.00-3.00 sec  4.81 GBytes  41.3 Gbits/sec
[  1] 3.00-4.00 sec  4.75 GBytes  40.8 Gbits/sec
[  1] 4.00-5.00 sec  4.86 GBytes  41.7 Gbits/sec
[  1] 5.00-6.00 sec  4.85 GBytes  41.7 Gbits/sec
[  1] 6.00-7.00 sec  4.91 GBytes  42.2 Gbits/sec
[  1] 7.00-8.00 sec  4.84 GBytes  41.6 Gbits/sec
[  1] 8.00-9.00 sec  4.80 GBytes  41.2 Gbits/sec
[  1] 9.00-10.00 sec  4.75 GBytes  40.8 Gbits/sec
[  1] 0.00-10.01 sec  48.1 GBytes  41.3 Gbits/sec
  • offloading
[  1] local 10.244.0.8 port 53332 connected with 10.96.117.116 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-1.00 sec  5.57 GBytes  47.8 Gbits/sec
[  1] 1.00-2.00 sec  5.52 GBytes  47.4 Gbits/sec
[  1] 2.00-3.00 sec  5.63 GBytes  48.4 Gbits/sec
[  1] 3.00-4.00 sec  5.50 GBytes  47.3 Gbits/sec
[  1] 4.00-5.00 sec  5.53 GBytes  47.5 Gbits/sec
[  1] 5.00-6.00 sec  5.61 GBytes  48.2 Gbits/sec
[  1] 6.00-7.00 sec  5.57 GBytes  47.9 Gbits/sec
[  1] 7.00-8.00 sec  5.65 GBytes  48.5 Gbits/sec
[  1] 8.00-9.00 sec  5.61 GBytes  48.2 Gbits/sec
[  1] 9.00-10.00 sec  5.57 GBytes  47.9 Gbits/sec
[  1] 0.00-10.01 sec  55.8 GBytes  47.9 Gbits/sec

https://docs.kernel.org/networking/nf_flowtable.html

How to test it

  1. build a kind node image
kind build node-image
  1. create a cluster with the new image using nftables proxy
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  kubeProxyMode: "nftables"
nodes:
- role: control-plane
kind create cluster --image kindest/node:latest  --config src/yamls/kind-nftables.yaml

Edit the kube-proxy config to use flowtables in all interfaces

 kubectl edit cm kube-proxy -n kube-system
    kind: KubeProxyConfiguration
    mode: nftables
    nftables:
      acceleratedInterfaceExpression: "interface.name != \"\""

Apply the changes

kubectl rollout restart ds kube-proxy -n kube-system
  1. Deploy two pods, a client and a server
 kubectl run client --image=registry.k8s.io/e2e-test-images/agnhost:2.53
 kubectl run server --image=registry.k8s.io/e2e-test-images/agnhost:2.53
  1. Expose the port 5001 through a Service in pod server
 kubectl expose pod server --port=5001
  1. in one window, set up an iperf server in the server pod
 kubectl exec -it server -- iperf -s
  1. In other window, run an iperf client against the pod IP (IMPORTANT), Pod IP here, this path is not accelerated
 kubectl exec -it client -- iperf -c 10.244.0.5
------------------------------------------------------------
Client connecting to 10.244.0.5, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.244.0.6 port 41054 connected with 10.244.0.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  49.5 GBytes  42.5 Gbits/sec
  1. Now run the same test against the ClusterIP
kubectl exec -it test -- iperf -c 10.96.157.37
------------------------------------------------------------
Client connecting to 10.96.157.37, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 10.244.0.6 port 44274 connected with 10.96.157.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.01 sec  56.1 GBytes  48.2 Gbits/sec

Development workflow

For those using kind or kube-proxy images, you don't need to rebuild the kind cluster internally, just from the kubernetes repo do

make quick-release-images

You'll have the tarball with the kube-proxy image that you can load directly into kind

$ kind load image-archive _output/release-images/amd64/kube-proxy.tar 
$ kubectl rollout restart ds kube-proxy -n kube-system

or locally to retag it and replace the existing image

$ docker load < _output/release-images/amd64/kube-proxy.tar
Loaded image: registry.k8s.io/kube-proxy-amd64:v1.32.0-alpha.2.328_c4102eb7359925
$ docker tag registry.k8s.io/kube-proxy-amd64:v1.32.0-alpha.2.328_c4102eb7359925 registry.k8s.io/kube-proxy-amd64:test
$ kubectl -n kube-system set image ds kube-proxy kube-proxy=registry.k8s.io/kube-proxy-amd64:test

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/code-generation area/dependency Issues or PRs related to dependency changes area/kube-proxy sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 28, 2024
@k8s-ci-robot k8s-ci-robot requested review from enj, logicalhan and a team October 28, 2024 17:31
@aojea
Copy link
Member Author

aojea commented Oct 28, 2024

/cc @npinaeva @danwinship

@thockin
Copy link
Member

thockin commented Oct 28, 2024

nice - now compare with eBPF?

@aojea
Copy link
Member Author

aojea commented Oct 28, 2024

nice - now compare with eBPF?

It is not really about ebpf or nftables, is about the kernel path and how to avoid the "slow paths", I expect an eBPF implementation that shortcut the kernel to have the same performance. Both netfilter/flowtables and eBPF (at least in cilium) hook in the tc subsystem and redirect the traffic to the output interface directly, shortcutting the kernel, AFAIK they do the same. The problem with eBPF, for doing just Services, you need to maintain a considerable amount of code, duplicate existing kernel features, consume more resources, ... this is barely 20 lines of code

To recap, we had different slow paths in kube-proxy

  1. Data plane programming, sending the whole blob of rules was slow, @danwinship implemented the partial syncs and I think there were some other improvements on the userspace @npinaeva helped to discover, IIRC we are now in a good state in this area ...

  2. the Service match was a linear search, @danwinship fixed that using a map in nftables so we should be in O(1) vs O(n)

    map service-ips {
                type ipv4_addr . inet_proto . inet_service : verdict
                comment "ClusterIP, ExternalIP and LoadBalancer IP traffic"
                elements = { 10.96.0.10 . tcp . 53 : goto service-NWBZK7IH-kube-system/kube-dns/tcp/dns-tcp,
                             10.96.0.10 . udp . 53 : goto service-FY5PMXPG-kube-system/kube-dns/udp/dns,
                             10.96.117.116 . tcp . 80 : goto service-6THF4ISO-default/service/tcp/http,
                             10.96.221.118 . tcp . 80 : goto service-WRR5O5HH-default/test3/tcp/,
                             10.96.210.227 . tcp . 90 : goto service-6CKWZ3LT-default/test/tcp/,
                             10.96.117.116 . tcp . 5001 : goto service-LYULNVLC-default/service/tcp/iperf,
                             10.96.0.1 . tcp . 443 : goto service-2QRHZV4L-default/kubernetes/tcp/https,
                             10.96.0.10 . tcp . 9153 : goto service-AS2KJYAD-kube-system/kube-dns/tcp/metrics }
        }
  1. The "performance" as throughput is fixed with this PR, is not likely you can go faster, bits need to travel from one place to another :)

@danwinship
Copy link
Contributor

Both netfilter/flowtables and eBPF (at least in cilium) hook in the tc subsystem and redirect the traffic to the output interface directly, shortcutting the kernel, AFAIK they do the same.

Much of Cilium's code is based on the tc hooks, but the standard eBPF shortcutting trick involves directly splicing two sockets together. In particular, that means the packet bypasses the network stack entirely, whereas the approach here only bypasses the network stack in the host netns. (The packets still traverse the network stack normally inside both containers.)

OTOH, the socket splicing trick, by definition, only works for traffic within a single node. Doing shortcutting for node-ingress/node-egress traffic requires a separate trick (which should be more-or-less equivalent to this one).

@aojea
Copy link
Member Author

aojea commented Oct 29, 2024

For completeness on what Dan said, the way of achieving performance in the datapath is just avoiding overhead processing the packets in the OS.

The most interesting case in network performance I've seen so far is on AI/ML workloads, those that don't even use TCP/IP because is "slow", and they use RDMA to bypass the OS and CPU, transferring the data between application memories.

But coming back to the TCP/IP world, AFAIK there are two common techniques to shortcut the path through the OS network stack by bypassing the kernel (adding some links there

@caseydavenport
Copy link
Member

caseydavenport commented Oct 29, 2024

@aojea nice! I was actually just prototyping this in Calico last week as well, coincidentally.

Changes look rather similar, although I've scoped the Calico offload rule to include non-service traffic as well so long as it matches a Calico owned interface. My understanding is that I should just be able to adjust the Calico table's priority to be slightly after kube-proxy's, which would ensure that kube-proxy flow offload is checked first for Service traffic and Calico can still offload other non-service traffic.

The problem with eBPF, for doing just Services, you need to maintain a considerable amount of code, duplicate existing kernel features, consume more resources, ... this is barely 20 lines of code

🥳

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@danwinship
Copy link
Contributor

(Related: the nftables flowtable rule does not depend on you using the nftables kube-proxy backend. It would work with iptables or ipvs too.)

you need nftables to program the dataplane AFAIK, so adding the nftables interface to the iptables proxy sounds a bit :/

It doesn't make sense to add it to the iptables proxy.Provider, but my point is that it doesn't make sense to add it to the nftables proxy.Provider either. The point of proxy.Provider is to watch for services and endpoints changing, and update the backend rules for those services and endpoints. The flowtable code has no dependencies on services or endpoints (or node topology labels). The nftables rules it adds do not interact with any of the other rules that the nftables proxier adds. The flowtable needs to be updated on an entirely separate schedule from when the service/endpoint rules need to be updated. Architecturally, it would make much more sense as a separate controller within kube-proxy (or outside of kube-proxy) that maintains its own table with its own rules and flowtable.

@aojea
Copy link
Member Author

aojea commented Oct 31, 2024

The flowtable code has no dependencies on services or endpoints (or node topology labels). The nftables rules it adds do not interact with any of the other rules that the nftables proxier adds

Let's take the angle of "Kubernetes Service acceleration", this feature is just "accelerating the Kubernetes Services traffic" and for that it need access to the ClusterIPs Set to avoid to interfere with any other network components on the nodes.

The kernel infrastructure only expose this functionality by using interface names, but I can see how it could have been exposed via connections and the neftables/netfilter/kenerl do the heavy lifting of identifying the network interfaces and adding the corresponding flows to the tables, as the skbuff IIRC will have both interfaces associated ... if this will have been exposed this way we'll have the same functionality but without different requirements ...

Anyway, I think is worth discussing, let's talk more in Kubecon and try to get more feedback ...

@cici37
Copy link
Contributor

cici37 commented Oct 31, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 31, 2024
@danwinship
Copy link
Contributor

This present some real interesting wins

Oh, so, not exactly the same thing, but our perf team here had done some testing with OVS offload to Mellanox NICs, and decided that it only makes sense for services with long-lived connections. If your service has lots of short-lived connections, then you end up spending more time configuring offload than you save by having offloaded the connection. (In the limiting case, if the connection closes immediately after you set up the offload, then you just completely wasted your time.) It's possible that this tradeoff applies more to hardware offload, where there are additional setup/cleanup steps required beyond what's required for the purely software offload case, but anyway, this feature could potentially benefit from having a hint on the Service that it has long-lived connections (and then it makes more sense at the kube-proxy level too...)

@aojea aojea changed the title Flowtables [WIP] Flowtables Dec 6, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 6, 2024
It brings the flowtables functionality
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2024
The kernel implement a flowtables infrastructure that allow to
accelerate packet forwarding using nftables.

Since connections with a small number of packets will not benefit of the
offloading, add a new option to the nftables kube-proxy so users can
define the minimum number of packets required to offload a connection to
the datapath. It also allow users to completely disable the behavior by
setting this option to 0.

By default connections with more than 20 packets are considered large
connections and offloaded to the fastpath.
@aojea aojea changed the title [WIP] Flowtables KEP-4963: Kube-proxy Services Acceleration Dec 9, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2024
@k8s-ci-robot
Copy link
Contributor

@aojea: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit a211e4b link true /test pull-kubernetes-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@aojea aojea changed the title KEP-4963: Kube-proxy Services Acceleration [WIP] KEP-4963: Kube-proxy Services Acceleration Dec 9, 2024
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 9, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@aroradaman aroradaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the first look, I thought we will have to explicitly clear fast path when the destination pod is removed, just like conntrack-udp cleanup. But it turns out it is in sync with conntrack table, I experimented with udp services to test it, after conntrack cleanup traffic was not blackholed 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/code-generation area/dependency Issues or PRs related to dependency changes area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

9 participants