kube-proxy loadbalancing: Use local pods only #7433

evpapp · 2015-04-28T14:04:26Z

As far as I can see kube-proxy uses round robin to schedule requests to pods within the controller.

How about an option to use only pods that are running on the same minion, if available? That would save a lot of unnecessary network traffic

a-robinson · 2015-04-28T17:15:38Z

It's worth doing some thinking about the potential failure cases, but if it checks out we'll happily accept PRs :)

thockin · 2015-04-28T17:20:02Z

This is actually something I want, though I described it a bit differently.

If pods that back this service exist on this node, always route to them
(round robin if > 1). Else route to a pod on another node.

That said, this is going to be tricky to test and prove correct, and we're
looking at a bunch of other change in kube-proxy right now, so it might be
best to hold off on this. If you want to proof-of-concept it and see how
hard it is, I'm happy to take a look. :)

@justinsb

On Tue, Apr 28, 2015 at 10:16 AM, Alex Robinson notifications@github.com
wrote:

It's worth doing some thinking about the potential failure cases, but if
it checks out we'll happily accept PRs :)

—
Reply to this email directly or view it on GitHub
#7433 (comment)
.

jaygorrell · 2016-08-23T19:23:59Z

This could be slightly more relevant now with Ingress and such, especially if the ingress controller is running in a daemonset and already RR'd before getting to kube-proxy, it would be nice to just use that node.

simonswine · 2016-09-08T08:15:54Z

I think this is a pretty important feature. Especially if you have a lot of hops and larger replica sizes for your pods.

I'd like to look into creating a PR for this. I am not 100% sure if this is the right approach, but I am planning to do:

Add flags to kube-proxy
- --local-pod-cidr=X
- --local-preferred=X
If enabled (--local-preferred=true) determine local pod CIDR (either through flag or NodeSpec.podCIDR)
If there is more than one local endpoint, filter the list of endpoints to only contain the local addreses. (needs to be done for userspace and iptables proxy)

Btw: I just figured out that 1.4's EndpointAddress contains a nodeName. Maybe that is the better way to determine local pods (As it should work for hostNetwork=true as well)

thockin · 2016-09-08T16:35:47Z

This was merged for 1.4, but ONLY for external LB VIP traffic (i.e. not for NodePorts and not for internal service IPs). This will give us a chance to get some miles on the idea, though. You already found some of the implementation.

@girishkalele

girishkalele · 2016-09-08T17:17:38Z

@simonswine NodeName was added to EndpointAddress in v1.4 precisely to allow kube-proxy to filter local endpoints for various future purposes, in 1.4 it is only using this data to create a new KUBE-XLB- iptables chain for specially annotated services and only if the alpha feature gate is enabled.

BTW, I would prefer not having a kube-proxy cmd line flag (which is an all-services-behave-this-way), but rather make it an annotation on the service, letting all service flavors work.

bluecmd · 2016-11-21T14:57:14Z

@thockin If I want this for NodePort and ClusterIPs today, am I looking at complex surgery in the internals of kube-proxy or a flag flip to enable it?

Background:

Our setup right now is that we have all pods IPs routed and accessible from everywhere within our network (i.e. far outside of Kubernetes' area of control) via BGP. We're also exporting Cluster IPs ("A.B.C.D via pod-abcef") from the Service + Endpoint objects. This works well to route the packets to a node that is able to serve the traffic -- but AFAICT the probabilistic loadbalancing in place today will most likely route the packet away from the node, even though it has already been routed to a perfectly good node (and balancing is done away from K8s).

thockin · 2016-11-21T17:34:30Z

@bprashanth added NodePort support for v1.5. ClusterIP is more
complicated. See, with NodePort and LBVIP, we can assume (or do assume)
that some frontend already made a choice of which node to target, so
staying node-local is legit. With ClusterIP, traffic is mostly assumed to
be originating locally, so there WILL be clients on nodes that don't have
backends for the services they are requesting, so we MUST sometimes leave
the node. But for local clients it is not such a big deal - that is the
first hop and needs no additional SNAT.

Finally, we have this other case where traffic for a clusterIP is routed to
a node acting as a gateway. In that case we do SNAT. It's unclear
whether it is a safe assumption that there was an initial routing decision
or not. If you are BGP advertising specific services as /32 and routing
them to nodes with backends, "only local" would be OK. If you're routing
blanket /16 to semi-random nodes, it is not.

On Mon, Nov 21, 2016 at 6:57 AM, Christian Svensson <
notifications@github.com> wrote:

@thockin https://github.com/thockin If I want this for NodePort and
ClusterIPs today, am I looking at complex surgery in the internals of
kube-proxy or a flag flip to enable it?

Background:

Our setup right now is that we have all pods IPs routed and accessible
from everywhere within our network (i.e. far outside of Kubernetes' area of
control) via BGP. We're also exporting Cluster IPs ("A.B.C.D via
pod-abcef") from the Service + Endpoint objects. This works well to route
the packets to a node that is able to serve the traffic -- but AFAICT the
probabilistic loadbalancing in place today will most likely route the
packet away from the node, even though it has already been routed to a
perfectly good node (and balancing is done away from K8s).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVG5yZdw0JU4vu6RhHUtHBd1IHhmGks5rAbFbgaJpZM4EKqpn
.

bluecmd · 2016-11-21T17:52:52Z

@thockin I see - it makes sense. We're running the latest and greatest 1.5 so I guess we could use NodePort. It seems a bit less "clean" as there is another resource (the node port) mixed into the game, but for now that will work for us.

FWIW, for Cluster IPs we're doing /32 announces.

thockin · 2016-11-21T19:07:46Z

I am open to proposals of how to design this wrt clusterIPs, for cases such
as yours.

On Mon, Nov 21, 2016 at 9:53 AM, Christian Svensson <
notifications@github.com> wrote:

@thockin https://github.com/thockin I see - it makes sense. We're
running the latest and greatest 1.5 so I guess we could use NodePort. It
seems a bit less "clean" as there is another resource (the node port) mixed
into the game, but for now that will work for us.

FWIW, for Cluster IPs we're doing /32 announces.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVEX5fSEAKSIpD7uJkDRXYt_jjeNJks5rAdqEgaJpZM4EKqpn
.

bluecmd · 2016-11-22T08:38:05Z

I've been giving this a thought, and for us we would like:

Function ForwardTowardsClusterIP():
  If traffic source is non-local:
    If count of local endpoints > 0:
      Select RoundRobin(local endpoints)
  Select RoundRobin(all endpoints)

I'll write together a bigger document with our setup and the different cases (node going down, pod becoming unhealthy, etc.) later, but the above should be enough to get the basic discussion started I hope.

bluecmd · 2016-11-24T13:52:01Z

@thockin The more I think about this the more I realize that maybe it would be a good idea to offer something like CNI but for internal traffic flows as well. We had even more discussions about this and we have a bunch of ideas we want to try out - like using ECMP on the node itself through normal routing - but as far as I can tell there is no pluggable way of totally ripping out the forwarding logic.

thockin · 2016-11-24T23:40:24Z

Maybe I misunderstand. If you just want to change the way Services work, you can replace kube-proxy with something else...

…

On Thu, Nov 24, 2016 at 5:52 AM, Christian Svensson < ***@***.***> wrote: @thockin <https://github.com/thockin> The more I think about this the more I realize that maybe it would be a good idea to offer something like CNI but for internal traffic flows as well. We had even more discussions about this and we have a bunch of ideas we want to try out - like using ECMP on the node itself through normal routing - but as far as I can tell there is no pluggable way of totally ripping out the forwarding logic. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7433 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVMgVPKxmbSOe_FvSwxYX8YXStRBeks5rBZaRgaJpZM4EKqpn> .

bluecmd · 2016-11-25T01:13:52Z

Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the packet forwarding path. We would like to experiment with doing things like IPIP / GRE forwarding and real L3 routing which requires quite some intelligence and integration with both the pods running the services and the nodes themselves.

bprashanth · 2016-11-25T01:19:47Z

What exactly are you missing in today's config, assuming you replace the static pod kube-proxy runs in with your own hostnetwork daemon to manage service vips and replace the kubenet binaries, which is the default CNI plugin on node, with your own to manage pod networking? You might need to turn off the route controller through a flag to the controller-manager (I'm blanking on the name) and write your own, but it sounds like what you want is already within reach. You can't manage service ip or node port allocation, but you can specify the block of ips/ports to choose from.

…

On Thu, Nov 24, 2016 at 5:14 PM, Christian Svensson < ***@***.***> wrote: Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the packet forwarding path. We would like to experiment with doing things like IPIP / GRE forwarding and real L3 routing which requires quite some intelligence and integration with both the pods running the services and the nodes themselves. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7433 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKa-zAk1vmjR-xuscceZbjg1K-o6JeTYks5rBjZjgaJpZM4EKqpn> .

thockin · 2016-11-25T04:13:24Z

Prashanth's answer is good, I think. If you stop kube-proxy and clean up the iptables rules, you can create an IP interface and use some other form of route discovery.

…

On Nov 24, 2016 5:14 PM, "Christian Svensson" ***@***.***> wrote: Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the packet forwarding path. We would like to experiment with doing things like IPIP / GRE forwarding and real L3 routing which requires quite some intelligence and integration with both the pods running the services and the nodes themselves. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7433 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVBX3ck5Mfh6tW3QTxm-8hwPdtwUsks5rBjZegaJpZM4EKqpn> .

bluecmd · 2016-11-25T12:51:23Z

Interesting, I'll give that a try. Thanks!

hubt · 2016-12-01T17:49:58Z

Another thing to consider in this area is that it may make sense to generalize a solution so that kube-proxy supports other routing preferences. We are talking about node local, but I could easily see AZ local and cluster local(in federated clusters). More dynamically, someone could even build support in kube-proxy for latency based service routing, which might take care of all of the solutions at once. Likely that's more difficult, but it'd be good to at least consider putting it on the roadmap.

bluecmd · 2016-12-20T09:03:13Z

I see that this bug is marked as "awaiting more evidence" and "team/cluster (deprecated)". Is there a way to breath more life into this bug? While I think for my use case I can work around by implementing GRE-encapsulation, I think it would be nice to have this feature to offer very simple load-balanced IP-level ingress.

thockin · 2016-12-27T23:53:18Z

@bluecmd This is open for proposals. It's not something I or my team is working on at the moment. I feel like we have laid out some options for exploration, and some of the likely hazards - enough to get a motivated person or team to investigate, I hope.

bluecmd · 2016-12-27T23:58:57Z

@thockin Fair enough. I did some thinking about this after our proof-of-concept and I agree the corner cases are not easily solved. FWIW, the approach I'm going forward with is tunnels using https://github.com/bluecmd/kube-service-tunnel

joshk0 · 2017-01-27T21:22:46Z

This was merged for 1.4, but ONLY for external LB VIP traffic (i.e. not for NodePorts and not for internal service IPs). This will give us a chance to get some miles on the idea, though. You already found some of the implementation.

@thockin, could you please clarify? Does this mean if I have an Ingress (GCE L7) pointing to a Service, the Ingress is configured to directly hit the pods in that Service without incurring kube-proxy overhead and round-robin rerouting?

joshk0 · 2017-01-27T21:26:06Z

or, sorry: "the Ingress is configured to directly hit nodes, which have Pods that are members of the Service, without incurring kube-proxy overhead and round-robin routing?"

thockin · 2017-01-28T01:27:58Z

Ingress is not affected by this. GCE's L7 saves the client IP in the X-Forwarded-For header.

…

On Fri, Jan 27, 2017 at 1:26 PM, Joshua Kwan ***@***.***> wrote: or, sorry: "the Ingress is configured to directly hit nodes, which have Pods that are members of the Service, without incurring kube-proxy overhead and round-robin routing?" — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7433 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVFWhzvWDNS6AOA_uWHEIp8GemMxyks5rWmECgaJpZM4EKqpn> .

m1093782566 · 2017-12-30T14:37:19Z

There is a proposal trying to resolve this issue, see kubernetes/community#1551

We need more user cases to refine the API, please feel free to populate your comments there :) Thanks!

fejta-bot · 2018-04-02T09:35:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

m1093782566 · 2018-04-02T09:37:44Z

/remove-lifecycle stale

fejta-bot · 2018-07-01T10:23:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2018-07-01T10:52:33Z

/remove-lifecycle stale

fejta-bot · 2018-09-29T11:19:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2018-09-29T17:34:21Z

/remove-lifecycle stale

steebchen · 2018-10-30T16:34:24Z

We currently build a traditional CDN, using anycast to route our users automatically to the nearest node (we have PoPs all around the world). However, with the current services, Kubernetes just load balances through all pods which basically makes them unusable for our use case, since a given node should also serve content and not proxy the request to a node on the other side of the world. That's probably a special use case which most users don't have, but since we're using anycast this problem is really driving us crazy.

steven-sheehy · 2018-10-30T16:53:50Z

@steebchen You can try swapping kube-proxy with kube-router as it supports other load-balancing algorithms besides round-robin.

steebchen · 2018-10-30T17:16:36Z

I checked that but I'm not sure if this is the right approach. First of all, this requires us to switch to kube-router (we use kube-proxy), but if I understood correctly we would have to get rid of weave net, which we need because of the encryption between nodes. Additionally, the only algorithm which is probably correct is the "Locality-Based Least-Connection Scheduling", but I'm already not so sure here:

If the server is overloaded (its active connection numbers is larger than its weight) and there is a server in its half load, then allocate the weighted least-connection server to this IP address.

Correct me if I'm wrong, but this does not seem to always prefer the local server. Plus, I already handle overloaded nodes by stop announcing IPs using BGP.

On top of that, this algorithm is not even listed in the kube-router LB options (only in the IPVPS docs).

Maybe I will just use hostPort on my pods directly, even if I don't really like that solution (I probably can't restart the pods without having a short downtime).

Since we use anycast with a bare metal cluster, I think I'll check out https://github.com/google/metallb and check whether we could use metallb Load Balancers.

m1093782566 · 2018-10-31T01:17:48Z

@steven-sheehy @steebchen

You can also look into kube-proxy IPVS mode(https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/ipvs) which supports all the algorithms of IPVS(kube-router do the same thing) without getting rid of weave net or any other network plugins.

m1093782566 · 2018-10-31T01:22:01Z

For node-local service and and other same topology-aware service routing, there is a proposal in progress, see: kubernetes/community#2846

steebchen · 2018-10-31T13:06:18Z

@m1093782566 The problem with the IPVS mode in kube-proxy is that you can only enable it globally. I would need to enable them using annotations for my service, but it's not supported yet. On top of that, I'm not sure if any of the load balancing algorithms are suitable, because I want to route traffic to local pods all (100%) of the time. First of all, it seems not all IPVS algorithms are supported (at least I couldn't find the options for them), and even the algorithm I think could work is probably not suited for my use case:

If the server is overloaded (its active connection numbers is larger than its weight) and there is a server in its half load, then allocate the weighted least-connection server to this IP address.

I basically want to always route to local pods. I will probably have to wait until services have an option to only route to local pods, or use google/metallb which I am currently checking out.

m1093782566 · 2018-11-01T02:52:05Z

@steebchen

First of all, it seems not all IPVS algorithms are supported (at least I couldn't find the options for them), and even the algorithm I think could work is probably not suited for my use case:

Please check: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-proxy/app/server.go#L174 and its all valid values can be found here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/apis/config/types.go#L172-L207

fejta-bot · 2019-01-30T03:21:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2019-01-30T07:30:54Z

/remove-lifecycle stale

mcginne · 2019-02-08T12:51:18Z

Hi guys, I notice this issue is pretty old, but I saw this coment ->
#7433 (comment) which implies that in v1.5 NodePorts should prefer local pods. I'm running on kube 1.12 and seeing my NodePort services not preferring local. Is there a flag required, or has something changed between 1.5 -> 1.12? I'm using iptables.

I've also noted this enhancement -> https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0033-service-topology.md so just wanting to confirm the expected behaviour on kube 1.12?

thockin · 2019-05-09T05:54:55Z

This is basically the same as #28610 so I am closing this one.

@mcginne look at the externalTrafficPolicy field.

a-robinson added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. team/cluster labels Apr 28, 2015

a-robinson modified the milestone: v1.0-post Apr 28, 2015

0xmichalis added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed team/cluster (deprecated - do not use) labels Mar 20, 2017

edsealing mentioned this issue Jan 12, 2018

Logstash Prefer a Local copy of ElasticSearch sealingtech/EDCOP-TOOLS#6

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2019

thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019

thockin closed this as completed May 9, 2019

guanxingfeng mentioned this issue Apr 14, 2021

How to achieve failover within 3 seconds when accessing via service name (kube-proxy) When the node fails #101101

Closed

kube-proxy loadbalancing: Use local pods only #7433

kube-proxy loadbalancing: Use local pods only #7433

Comments

evpapp commented Apr 28, 2015

a-robinson commented Apr 28, 2015

thockin commented Apr 28, 2015

jaygorrell commented Aug 23, 2016

simonswine commented Sep 8, 2016

thockin commented Sep 8, 2016

girishkalele commented Sep 8, 2016

bluecmd commented Nov 21, 2016

thockin commented Nov 21, 2016

bluecmd commented Nov 21, 2016

thockin commented Nov 21, 2016

bluecmd commented Nov 22, 2016

bluecmd commented Nov 24, 2016

thockin commented Nov 24, 2016 via email

bluecmd commented Nov 25, 2016

bprashanth commented Nov 25, 2016 via email

thockin commented Nov 25, 2016 via email

bluecmd commented Nov 25, 2016

hubt commented Dec 1, 2016

bluecmd commented Dec 20, 2016

thockin commented Dec 27, 2016

bluecmd commented Dec 27, 2016

joshk0 commented Jan 27, 2017

joshk0 commented Jan 27, 2017

thockin commented Jan 28, 2017 via email

m1093782566 commented Dec 30, 2017 • edited Loading

fejta-bot commented Apr 2, 2018

m1093782566 commented Apr 2, 2018

fejta-bot commented Jul 1, 2018

george-angel commented Jul 1, 2018

fejta-bot commented Sep 29, 2018

george-angel commented Sep 29, 2018

steebchen commented Oct 30, 2018

steven-sheehy commented Oct 30, 2018

steebchen commented Oct 30, 2018 • edited Loading

m1093782566 commented Oct 31, 2018 • edited Loading

m1093782566 commented Oct 31, 2018

steebchen commented Oct 31, 2018

m1093782566 commented Nov 1, 2018 • edited Loading

fejta-bot commented Jan 30, 2019

george-angel commented Jan 30, 2019

mcginne commented Feb 8, 2019

thockin commented May 9, 2019

m1093782566 commented Dec 30, 2017 •

edited

Loading

steebchen commented Oct 30, 2018 •

edited

Loading

m1093782566 commented Oct 31, 2018 •

edited

Loading

m1093782566 commented Nov 1, 2018 •

edited

Loading