-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement IPVS-based in-cluster service load balancing #44063
Comments
Even though Kubernetes 1.6 support 5000 nodes, but the kube-proxy with iptables is actually a bottleneck to scale the cluster to 5000 nodes. One example is that with |
@quinton-hoole is your implementation using IPVS in nat mode or Direct routing mode? |
@ravilr it's NAT mode. |
gr8 work, the iptables issues has been a problem for a while. re: flow based scheduling, happy to help get the firmament scheduler in place. ;-) |
@timothysc I paid attention to firmament for a period of time, but not quite get it's value adding to kubernetes. Would you mind to explain what problem flow based scheduling can solve in current Kubernetes scheduler? |
@resouer speed at scale and rescheduling. From @quinton-hoole 's talk, linked above, it looks like huawei has been prototyping this. |
@resouer @timothysc Yes, I can confirm that we're working on a Firmament scheduler, and will upstream it as soon as it's in good enough shape. We might have an initial implementation in the next few weeks. |
Hi folks, we are currently working on implementing Firmament Scheduler as part of K8S scheduling environment. We will create a new separate issue to track the progress and provide updates etc. thanks. |
@quinton-hoole thanks for sharing. Waiting to see the design proposal. In term of Healthcheck, every worker doing health check across all pods to keep the table upto date?, how are you planning to handle this @scale ? |
cc @haibinxie |
To be clear, @haibinxie did all the hard work here. Please direct questions to him. |
IPVS only deals with IP, not transport protocols, right? A k8s service can include a port transformation. A Service object has a potential distinction between |
@MikeSpreitzer Port transformation is well supported. |
Note that even with IPVS, NodePorts, firewalls, and so on still have to be
handled and the obvious place to handle much of it is iptables.
…On Tue, Apr 4, 2017 at 8:31 PM, Guangya Liu ***@***.***> wrote:
Even though Kubernetes 1.6 support 5000 nodes, but the kube-proxy with
iptables is actually a bottleneck to scale the cluster to 5000 nodes. One
example is that with NodePort service in a 5000 node cluster, if I have
2000 services and each services have 10 pods, this will cause 20000 iptable
records on each worker node, and this can make the kernel pretty busy.
Using IPVS-based in-cluster service load balancing can help a lot for such
case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44063 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVIWXR-AXPGcK7veFpIIhkdQLFspnks5rswsMgaJpZM4MzkIA>
.
|
Port mapping is handled in NAT mode (called masquerade by IPVS, sadly). As an optimization, a future followup could enable direct-return mode for environments that support it for services that do not do remapping. We'd have to add service IPs as local addresses in pods, which we may want to do anyway. |
Last comment for the record here, though I have said it elsewhere. I am very much in favor of an IPVS implementation. We have somewhat more than JUST load-balancing in our iptables (session affinity, firewalls, hairpin-masquerade tricks), but I believe those can all be overcome. We also have been asked, several times, to add support for port ranges to Services, up to and including a whole IP. The obvious way to add this would also support remapping, though it is not at all clear how NodePorts would work. IPVS, as far as I know, has no facility for exposing ranges of ports. |
In IPVS mode, we have to add all the service address to host device like lo or ethx, am I right? |
Hi All, I put together a proposal for the alpha version of IPVS implementation hoping to get into kubernetes 1.7. need your feedback. https://docs.google.com/document/d/1YEBWR4EWeCEWwxufXzRM0e82l_lYYzIXQiSayGaVQ8M/edit?usp=sharing @kubernetes/sig-network-feature-requests |
FYI |
Does the kube-router can help this? |
Have we seen code for this yet? At this point, there's simply no way it is
making v1.7, but I'd love to queue up something here for v1.8
…On Wed, May 24, 2017 at 10:37 PM, Guang Ya Liu ***@***.***> wrote:
Does the kube-router <https://github.com/cloudnativelabs/kube-router> can
help this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44063 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVLs2Jmug8fsBHLEHsj_CyKZW9tXWks5r9ROCgaJpZM4MzkIA>
.
|
Code is done and going through final e2e testing. The plan is to get it
into v1.8.
…On May 26, 2017 23:20, "Tim Hockin" ***@***.***> wrote:
Have we seen code for this yet? At this point, there's simply no way it is
making v1.7, but I'd love to queue up something here for v1.8
On Wed, May 24, 2017 at 10:37 PM, Guang Ya Liu ***@***.***>
wrote:
> Does the kube-router <https://github.com/cloudnativelabs/kube-router>
can
> help this?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/kubernetes/kubernetes/issues/
44063#issuecomment-303927263>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-
auth/AFVgVLs2Jmug8fsBHLEHsj_CyKZW9tXWks5r9ROCgaJpZM4MzkIA>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44063 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJ6NAdE8hjCgsuLh35NAIesHyvhWtiPVks5r98CkgaJpZM4MzkIA>
.
|
But nobody has seen it yet? Publish early and often. You *know* there's
going to be a lot of feedback - get that code out now now now. Don't sit
on it. Don't polish it. PUBLISH. Or risk that it doesn't get merged
because of the depth of feedback, or that someone else implements it while
you're busy polishing.
We have a fairly robust e2e suite. You should be able to just boot with
this enabled and if it passes e2e, that's a strong signal.
On Sat, May 27, 2017 at 2:12 PM, Quinton Hoole <notifications@github.com>
wrote:
… Code is done and going through final e2e testing. The plan is to get it
into v1.8.
On May 26, 2017 23:20, "Tim Hockin" ***@***.***> wrote:
> Have we seen code for this yet? At this point, there's simply no way it
is
> making v1.7, but I'd love to queue up something here for v1.8
>
> On Wed, May 24, 2017 at 10:37 PM, Guang Ya Liu ***@***.***
>
> wrote:
>
> > Does the kube-router <https://github.com/cloudnativelabs/kube-router>
> can
> > help this?
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/kubernetes/kubernetes/issues/
> 44063#issuecomment-303927263>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-
> auth/AFVgVLs2Jmug8fsBHLEHsj_CyKZW9tXWks5r9ROCgaJpZM4MzkIA>
> > .
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/kubernetes/kubernetes/issues/
44063#issuecomment-304431416>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AJ6NAdE8hjCgsuLh35NAIesHyvhWtiPVks5r98CkgaJpZM4MzkIA>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#44063 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVPp-EpTDj4x7dNZAi2LcgklBy9Xqks5r-JGggaJpZM4MzkIA>
.
|
Initial PR #46580 is already sent out. PTAL. |
@thockin originally we were relying on seesaw library and had a plan of updating it to a pure go implementation as phase 2 (probably in 1.8). Because of the complexities introduced by libnl.so dependencies last week we decided to move away from |
Automatic merge from submit-queue (batch tested with PRs 51377, 46580, 50998, 51466, 49749) Implement IPVS-based in-cluster service load balancing **What this PR does / why we need it**: Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube-proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on). **Which issue this PR fixes** #17470 #44063 **Special notes for your reviewer**: * Since the PR is a bit large, I splitted it and move the commits related to ipvs util pkg to PR #48994. Hopefully can make it easier to review. @thockin @quinton-hoole @kevin-wangzefeng @deepak-vij @haibinxie @dhilipkumars @fisherxu **Release note**: ```release-note Implement IPVS-based in-cluster service load balancing ```
/area ipvs |
IPVS-based kube-proxy is in beta phase now. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
With this issue closed as stale, is there a better issue to follow progress of adding ipvs scheduling algorithms to individual kubernetes services? I couldn't find another issue that explicitly covers this part of the ipvs roadmap. |
It's mentioned here as a future possibility, but like you, I haven't been able to find an issue where this is being actively discussed/worked on. Does anyone know whether there's ongoing work to make service-specific load-balancing algorithms possible? |
At KubeCon Europe in Berlin last week I presented some work we've done at Huawei scaling Kubernetes in-cluster load balancing to 50,000+ services and beyond, the challenges associated with doing this using the current iptables approach, and what we've achieved using an alternative IPVS-based approach. iptables is designed for firewalling, and based on in-kernel rule lists, while IPVS is designed for load balancing and based on in-kernel hash tables. IPVS also supports more sophisticated load balancing algorithms than iptables (least load, least conns, locality, weighted) as well as other useful features (e.g. health checking, retries etc).
After the presentation, there was strong support (a.k.a. a riot :-) ) for us to open source this work, which we are happy to do. We can use this issue to track that.
For those who were not able to be there, here is the video:
https://youtu.be/c7d_kD2eH4w
And the slides:
https://docs.google.com/presentation/d/1BaIAywY2qqeHtyGZtlyAp89JIZs59MZLKcFLxKE6LyM/edit?usp=sharing
We will follow up on this with a more formal design proposal, and a set of PR's, but in summary we added a about 680 lines of code to the existing 12,000 lines of kube-proxy (~5%), and added a third mode flag to it's command-line (mode=IPVS, to the existing mode=userspace and mode=iptables).
Performance improvement of load balancer updates is dramatic (update latency reduced from hours per rule to 2ms per rule). Network latency and variability also reduced dramatically for large numbers of services.
@kubernetes/sig-network-feature-requests
@kubernetes/sig-scalability-feature-requests
@thockin
@wojtek-t
The text was updated successfully, but these errors were encountered: