-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try kube-proxy via ipvs instead of iptables or userspace #17470
Comments
Pretty busy right now, last round of midterms and final projects right now then soon-ish I have finals. We're out for the holidays in about 3 weeks though (done for sure by December 12th). I'll be sure to take a look if/when I can find the time though! |
I was kidding :)
|
Ah, whizzed right over my head. :) I very much enjoy working in OSS though, If I don't get wrapped up in I'll stop cluttering this issue for now though :) On Wed, Nov 18, 2015 at 9:55 PM, Tim Hockin notifications@github.com
|
@thockin I like this idea. Use I noticed Andrey Sibiryov who is from Uber had also given a session "Kernel load-balancing for Docker containers using IPVS" on DockerCon 2015 eu. Please see DockerCon 2015 eu Agenda. |
Yeah, I think this is actually not a very hard project, but I'd want to see On Sun, Nov 22, 2015 at 11:04 PM, qiaolei notifications@github.com wrote:
|
Video DockerCon 2015 eu Kernel load-balancing for Docker containers using IPVS |
yeah, IPVS works. I tried it out a few months back, but I was missing a On Thu, Dec 3, 2015 at 8:09 PM, Manuel Alejandro de Brito Fontes <
|
also it would be cool then to have k8s-services utilize the ipvs features, like persistence and selecting the balance-strategy (and even weights?)
|
Remind me something :) #3760 (comment) |
Interesting. |
Whilst poking around for other threads, I found this... moby/libnetwork#852 |
@kubernetes/huawei |
Spoke with some folks internally, we think this is a good path of investigation (although expensive). |
Expensive in what regard?
|
Developer time.
|
Maybe we could make kube-proxy pluggable, so everyone can integration with their own implementation as needed. |
That's sort of the idea. We will build a few implementations I to it, but
|
@thockin I gave this a spin with the goal of understanding how it might interact with iptables policy rules such as those used by Calico.
( I found that:
I also found that the requirement to have the service IP on a local dummy interface on the host was a bit of a pain. Running a command such as If I remove Calico's |
you need to enable connection tracking using "sysctl -w net.ipv4.vs.conntrack=1" |
@aledbf Thanks for the tip, I'll give that a try. |
I retested with the
All seemed to work as expected and policy was being applied. I did not try connecting from pod to its own service IP. There still might be wrinkles there due to the need to SNAT those packets. |
Update: while I had the rig set up, I checked the latter case and it also seems to work as expected. I manually inserted an iptables rule that masqueraded looped-back traffic. |
Having to create a dummy IP for each service would be unfortunate. Is On Mon, Jun 6, 2016 at 10:07 AM, Shaun Crampton notifications@github.com
|
It seems docker 1.12 will ship Service based on IPVS |
fasaxc: Thanks a lot |
@hhzguo I did a manual test that matched on the specific IP addresses I was expecting, nothing that was production ready, I'm afraid. |
This only work for request from container, but does not work for request from host, because ipvs does not support client and director on the same machine. Is there any workaround? |
Does this really works? it seems that ipvs does not support client and director on the same machine. |
Where do you see that this is not supported? That is a transcript of my
test - yeah, it worked (NB this is in masquerade mode)
…On Wed, Dec 21, 2016 at 1:24 AM, starsdeep ***@***.***> wrote:
@thockin <https://github.com/thockin>
***@***.***:/home/thockin# ipvsadm -A -t 10.9.8.7:12345 -s rr
***@***.***:/home/thockin# ipvsadm -a -t 10.9.8.7:12345 -m -r 10.244.1.27:9376
***@***.***:/home/thockin# ipvsadm -a -t 10.9.8.7:12345 -m -r 10.244.1.28:9376
***@***.***:/home/thockin# ip addr add 10.9.8.7/32 dev Ethan
***@***.***:/home/thockin# curl 10.9.8.7:12345
hostB
***@***.***:/home/thockin# curl 10.9.8.7:12345
hostA
Does this really works? it seems that ipvs does not support client and
director on the same machine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17470 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVOg6gQ1Se11ygx1mv4xBahf8OasEks5rKPA1gaJpZM4GlM7S>
.
|
I manually test this:
More information: After clearing ALL iptables and restarting docker and flanneld, iptables on my host is as follows:
test pods and svc:
ipvs rules:
dummy interface
curl : in container works
curl : on host does not work
|
IPVS definitely supports client and director on the same machine. I just tried re-running your example @starsdeep and it worked: [root@aws ~]# ip link add eth1 type dummy
[root@aws ~]# ip addr add 10.8.8.8/32 dev eth1
[root@aws ~]# ipvsadm -A -t 10.8.8.8:10000 -s rr
[root@aws ~]# docker run -d nginx
b57c0a31491efa19f1820100ad123952f6d2ec6f60eaf923e0e220dcb5b69578
[root@aws ~]# ipvsadm -a -t 10.8.8.8:10000 -m -r $(docker inspect -f '{{ (index .NetworkSettings.Networks "bridge").IPAddress }}' b57c):80
[root@aws ~]# curl -sS --head 10.8.8.8:10000
HTTP/1.1 200 OK
Server: nginx/1.11.8
Date: Wed, 04 Jan 2017 17:16:26 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 27 Dec 2016 14:23:08 GMT
Connection: keep-alive
ETag: "5862794c-264"
Accept-Ranges: bytes Maybe there're some other 3rd-party networking toolkits you have running on that box that might interfere with experiment? |
it seems there is more concrete work ongoing for this: #38969 #38817 Related mail thread: https://groups.google.com/forum/#!topic/kubernetes-sig-network/59HG9SlypBc |
there have been 2 talks at kubecon berlin regarding IPVS:
|
We have a tested implementation of IPVS kubeproxy in #44063 |
@kobolog |
@ChenLingPeng normally you'd want IPVS running either on your origin or your destination host in this kind of setups – this is because by default IPVS NAT will only do DNAT so if you have an IPVS-in-the-middle then the response will not hit it on the way back and from origin's point-of-view it's gonna be a martian packet. There are a few ways to have IPVS-in-the-middle if you want it anyway, e.g. you can use a varation of source-based routing where each backend has multiple IPs – one per IPVS – and then configure static routes on them so that traffic is routed back to a corresponding IPVS host. Example: you got two IPVS hosts: |
I was testing IPVS based service proxy solution in Kube-router [1] for Kubernetes last couple of days. Here are my observations. First, it seems hard to use direct routing. We need assign VIP (cluster IP/node IP) to the pods and pods across the nodes and nodes should be in same L2 domain, as MAC rewrite is done and need to be sent directly to the pod by IPVS. So we have to use ipvs masquerade mode for a viable solution. But it requires that reverse traffic from the pods has to go through node so that source IP is replaced with cluster IP/node ip which ever was used. One soultion is to to do both SNAT (to replace source IP with node's IP) and DNAT, in which case we will still need iptable rules to do SNAT. But doing SNAT will break the network policies enforcements as we loose source IP. However if you are using host gateway [2] or cloud gateway [2] based routing for pod-to-pod connectivity then things just fall in place with out needing SNAT. With node port based service, when client is outside the cluster, reverse traffic will not hit IPVS node. So its essential that traffic to node port needs both SNAT and DNAT. Source IP itself has no significance for non-pod clients so network policies is not issue in this case. Ofcourse this is not unique problem for IPVS, but even for IPVS proxier we will still need iptable rules that will support --cluster-cidr and masquerade-all flags [1] https://github.com/cloudnativelabs/kube-router/blob/master/app/controllers/network_services_controller.go |
continuing @starsdeep 's experiment, but it works both in container and in host (only one host). Is there anything specific in your network environment @starsdeep ? # prepare local kubernetes cluster
$ sudo ./hack/local-up-cluster.sh
$ sudo kill -9 $KUBE_PROXY_PID
# run two nginx pods
$ kubectl run --image nginx --replicas=2 nginx
# expose deployment
$ kubectl expose deployment nginx --port=80 --target-port=80
$ kubectl get services
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes 10.0.0.1 <none> 443/TCP 3m
nginx 10.0.0.185 <none> 80/TCP 4s
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-348975970-7x18g 1/1 Running 0 49s 172.17.0.3 127.0.0.1
nginx-348975970-rtqrz 1/1 Running 0 49s 172.17.0.4 127.0.0.1
# Add dummy link
$ sudo ip link add type dummy
$ sudo ip addr add 10.0.0.185 dev dummy0
# Add ipvs rules; real server should use nat mode, since host is essentially
# the gateway.
$ sudo ipvsadm -A -t 10.0.0.185:80
$ sudo ipvsadm -a -t 10.0.0.185:80 -r 172.17.0.3:80 -m
$ sudo ipvsadm -a -t 10.0.0.185:80 -r 172.17.0.4:80 -m
$ sudo ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.0.0.185:80 wlc
-> 172.17.0.3:80 Masq 1 0 1
-> 172.17.0.4:80 Masq 1 0 1
# Works in container
$ docker run -ti busybox wget -qO- 10.0.0.185:80
<!DOCTYPE html>
// truncated
# Works in host
$ curl 10.0.0.185:80
<!DOCTYPE html>
// truncated To use dr mode, I'v created another dummy interface in pod as well... # continue above setup;
$ PID=$(docker inspect -f '{{.State.Pid}}' k8s_nginx_nginx-348975970-rtqrz_default_b1661284-2eeb-11e7-924d-8825937fa049_0)
$ sudo mkdir -p /var/run/netns
$ sudo ln -s /proc/$PID/ns/net /var/run/netns/$PID
$ sudo ip link add type dummy
$ sudo ip link set dummy1 netns $PID
$ sudo ip netns exec $PID ip addr add 10.0.0.185 dev dummy1
$ sudo ip netns exec $PID ip link set dummy1 up
# same for the other pod
$ sudo ipvsadm -D -t 10.0.0.185:80
$ sudo ipvsadm -A -t 10.0.0.185:80
$ sudo ipvsadm -a -t 10.0.0.185:80 -r 172.17.0.3:80 -g
$ sudo ipvsadm -a -t 10.0.0.185:80 -r 172.17.0.4:80 -g
$ docker run -ti busybox wget -qO- 10.0.0.185:80
<!DOCTYPE html>
// truncated
// ignored seeting arp_ignore/arp_announce Just a quick and dirty experiment. Also, links from @guybrush are outdated, repost here |
I assume need at least 2 hosts, and try to visit container(in the next host) from host via vip. In my observation, response won't come back. |
@m1093782566 yeah, i'd think so but haven't had time to look at it yet. I'm playing with a single host since @starsdeep only uses |
nice job! |
Automatic merge from submit-queue (batch tested with PRs 51377, 46580, 50998, 51466, 49749) Implement IPVS-based in-cluster service load balancing **What this PR does / why we need it**: Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube-proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on). **Which issue this PR fixes** #17470 #44063 **Special notes for your reviewer**: * Since the PR is a bit large, I splitted it and move the commits related to ipvs util pkg to PR #48994. Hopefully can make it easier to review. @thockin @quinton-hoole @kevin-wangzefeng @deepak-vij @haibinxie @dhilipkumars @fisherxu **Release note**: ```release-note Implement IPVS-based in-cluster service load balancing ```
/close IPVS is now in alpha form |
We should see if we can make ipvs do everything we need - it should perform even better than iptables. A benchmark is in order.
Notes:
"masq" mode is DNAT not SNAT src ip is preserved.
We have to assign the VIP to some interface in the root NS. This is a bit ugly in that ports NOT exposed by the VIP get sent to the host (e.g. 22). I think we can fix that by adding another catchall for the VIP. I don't know if there are limits to local IPs
Not sure if there is a atomic batch update command, but it does handle batch invocation at least.
Several scheduling policies, but
rr
seems sufficient, maybelc
.sh
seems to give us client affinity.We can configure timeouts.
We'll need to do something for node-ports, probably still iptables. I think this (and the other tricks we pull for load-balancers) will be the biggest challenge.
@BenTheElder busy? :)
The text was updated successfully, but these errors were encountered: