Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy loadbalancing: Use local pods only #7433

Closed
evpapp opened this issue Apr 28, 2015 · 47 comments
Closed

kube-proxy loadbalancing: Use local pods only #7433

evpapp opened this issue Apr 28, 2015 · 47 comments
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@evpapp
Copy link

evpapp commented Apr 28, 2015

As far as I can see kube-proxy uses round robin to schedule requests to pods within the controller.

How about an option to use only pods that are running on the same minion, if available? That would save a lot of unnecessary network traffic

@a-robinson
Copy link
Contributor

It's worth doing some thinking about the potential failure cases, but if it checks out we'll happily accept PRs :)

@a-robinson a-robinson added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. team/cluster labels Apr 28, 2015
@a-robinson a-robinson modified the milestone: v1.0-post Apr 28, 2015
@thockin
Copy link
Member

thockin commented Apr 28, 2015

This is actually something I want, though I described it a bit differently.

If pods that back this service exist on this node, always route to them
(round robin if > 1). Else route to a pod on another node.

That said, this is going to be tricky to test and prove correct, and we're
looking at a bunch of other change in kube-proxy right now, so it might be
best to hold off on this. If you want to proof-of-concept it and see how
hard it is, I'm happy to take a look. :)

@justinsb

On Tue, Apr 28, 2015 at 10:16 AM, Alex Robinson notifications@github.com
wrote:

It's worth doing some thinking about the potential failure cases, but if
it checks out we'll happily accept PRs :)


Reply to this email directly or view it on GitHub
#7433 (comment)
.

@jaygorrell
Copy link

This could be slightly more relevant now with Ingress and such, especially if the ingress controller is running in a daemonset and already RR'd before getting to kube-proxy, it would be nice to just use that node.

@simonswine
Copy link
Contributor

I think this is a pretty important feature. Especially if you have a lot of hops and larger replica sizes for your pods.

I'd like to look into creating a PR for this. I am not 100% sure if this is the right approach, but I am planning to do:

  • Add flags to kube-proxy
    • --local-pod-cidr=X
    • --local-preferred=X
  • If enabled (--local-preferred=true) determine local pod CIDR (either through flag or NodeSpec.podCIDR)
  • If there is more than one local endpoint, filter the list of endpoints to only contain the local addreses. (needs to be done for userspace and iptables proxy)

Btw: I just figured out that 1.4's EndpointAddress contains a nodeName. Maybe that is the better way to determine local pods (As it should work for hostNetwork=true as well)

@thockin
Copy link
Member

thockin commented Sep 8, 2016

This was merged for 1.4, but ONLY for external LB VIP traffic (i.e. not for NodePorts and not for internal service IPs). This will give us a chance to get some miles on the idea, though. You already found some of the implementation.

@girishkalele

@girishkalele
Copy link

@simonswine NodeName was added to EndpointAddress in v1.4 precisely to allow kube-proxy to filter local endpoints for various future purposes, in 1.4 it is only using this data to create a new KUBE-XLB- iptables chain for specially annotated services and only if the alpha feature gate is enabled.

BTW, I would prefer not having a kube-proxy cmd line flag (which is an all-services-behave-this-way), but rather make it an annotation on the service, letting all service flavors work.

@bluecmd
Copy link

bluecmd commented Nov 21, 2016

@thockin If I want this for NodePort and ClusterIPs today, am I looking at complex surgery in the internals of kube-proxy or a flag flip to enable it?

Background:

Our setup right now is that we have all pods IPs routed and accessible from everywhere within our network (i.e. far outside of Kubernetes' area of control) via BGP. We're also exporting Cluster IPs ("A.B.C.D via pod-abcef") from the Service + Endpoint objects. This works well to route the packets to a node that is able to serve the traffic -- but AFAICT the probabilistic loadbalancing in place today will most likely route the packet away from the node, even though it has already been routed to a perfectly good node (and balancing is done away from K8s).

@thockin
Copy link
Member

thockin commented Nov 21, 2016

@bprashanth added NodePort support for v1.5. ClusterIP is more
complicated. See, with NodePort and LBVIP, we can assume (or do assume)
that some frontend already made a choice of which node to target, so
staying node-local is legit. With ClusterIP, traffic is mostly assumed to
be originating locally, so there WILL be clients on nodes that don't have
backends for the services they are requesting, so we MUST sometimes leave
the node. But for local clients it is not such a big deal - that is the
first hop and needs no additional SNAT.

Finally, we have this other case where traffic for a clusterIP is routed to
a node acting as a gateway. In that case we do SNAT. It's unclear
whether it is a safe assumption that there was an initial routing decision
or not. If you are BGP advertising specific services as /32 and routing
them to nodes with backends, "only local" would be OK. If you're routing
blanket /16 to semi-random nodes, it is not.

On Mon, Nov 21, 2016 at 6:57 AM, Christian Svensson <
notifications@github.com> wrote:

@thockin https://github.com/thockin If I want this for NodePort and
ClusterIPs today, am I looking at complex surgery in the internals of
kube-proxy or a flag flip to enable it?

Background:

Our setup right now is that we have all pods IPs routed and accessible
from everywhere within our network (i.e. far outside of Kubernetes' area of
control) via BGP. We're also exporting Cluster IPs ("A.B.C.D via
pod-abcef") from the Service + Endpoint objects. This works well to route
the packets to a node that is able to serve the traffic -- but AFAICT the
probabilistic loadbalancing in place today will most likely route the
packet away from the node, even though it has already been routed to a
perfectly good node (and balancing is done away from K8s).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVG5yZdw0JU4vu6RhHUtHBd1IHhmGks5rAbFbgaJpZM4EKqpn
.

@bluecmd
Copy link

bluecmd commented Nov 21, 2016

@thockin I see - it makes sense. We're running the latest and greatest 1.5 so I guess we could use NodePort. It seems a bit less "clean" as there is another resource (the node port) mixed into the game, but for now that will work for us.

FWIW, for Cluster IPs we're doing /32 announces.

@thockin
Copy link
Member

thockin commented Nov 21, 2016

I am open to proposals of how to design this wrt clusterIPs, for cases such
as yours.

On Mon, Nov 21, 2016 at 9:53 AM, Christian Svensson <
notifications@github.com> wrote:

@thockin https://github.com/thockin I see - it makes sense. We're
running the latest and greatest 1.5 so I guess we could use NodePort. It
seems a bit less "clean" as there is another resource (the node port) mixed
into the game, but for now that will work for us.

FWIW, for Cluster IPs we're doing /32 announces.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVEX5fSEAKSIpD7uJkDRXYt_jjeNJks5rAdqEgaJpZM4EKqpn
.

@bluecmd
Copy link

bluecmd commented Nov 22, 2016

I've been giving this a thought, and for us we would like:

Function ForwardTowardsClusterIP():
  If traffic source is non-local:
    If count of local endpoints > 0:
      Select RoundRobin(local endpoints)
  Select RoundRobin(all endpoints)

I'll write together a bigger document with our setup and the different cases (node going down, pod becoming unhealthy, etc.) later, but the above should be enough to get the basic discussion started I hope.

@bluecmd
Copy link

bluecmd commented Nov 24, 2016

@thockin The more I think about this the more I realize that maybe it would be a good idea to offer something like CNI but for internal traffic flows as well. We had even more discussions about this and we have a bunch of ideas we want to try out - like using ECMP on the node itself through normal routing - but as far as I can tell there is no pluggable way of totally ripping out the forwarding logic.

@thockin
Copy link
Member

thockin commented Nov 24, 2016 via email

@bluecmd
Copy link

bluecmd commented Nov 25, 2016

Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the packet forwarding path. We would like to experiment with doing things like IPIP / GRE forwarding and real L3 routing which requires quite some intelligence and integration with both the pods running the services and the nodes themselves.

@bprashanth
Copy link
Contributor

bprashanth commented Nov 25, 2016 via email

@thockin
Copy link
Member

thockin commented Nov 25, 2016 via email

@bluecmd
Copy link

bluecmd commented Nov 25, 2016

Interesting, I'll give that a try. Thanks!

@hubt
Copy link

hubt commented Dec 1, 2016

Another thing to consider in this area is that it may make sense to generalize a solution so that kube-proxy supports other routing preferences. We are talking about node local, but I could easily see AZ local and cluster local(in federated clusters). More dynamically, someone could even build support in kube-proxy for latency based service routing, which might take care of all of the solutions at once. Likely that's more difficult, but it'd be good to at least consider putting it on the roadmap.

@bluecmd
Copy link

bluecmd commented Dec 20, 2016

I see that this bug is marked as "awaiting more evidence" and "team/cluster (deprecated)". Is there a way to breath more life into this bug? While I think for my use case I can work around by implementing GRE-encapsulation, I think it would be nice to have this feature to offer very simple load-balanced IP-level ingress.

@thockin
Copy link
Member

thockin commented Dec 27, 2016

@bluecmd This is open for proposals. It's not something I or my team is working on at the moment. I feel like we have laid out some options for exploration, and some of the likely hazards - enough to get a motivated person or team to investigate, I hope.

@bluecmd
Copy link

bluecmd commented Dec 27, 2016

@thockin Fair enough. I did some thinking about this after our proof-of-concept and I agree the corner cases are not easily solved. FWIW, the approach I'm going forward with is tunnels using https://github.com/bluecmd/kube-service-tunnel

@joshk0
Copy link

joshk0 commented Jan 27, 2017

This was merged for 1.4, but ONLY for external LB VIP traffic (i.e. not for NodePorts and not for internal service IPs). This will give us a chance to get some miles on the idea, though. You already found some of the implementation.

@thockin, could you please clarify? Does this mean if I have an Ingress (GCE L7) pointing to a Service, the Ingress is configured to directly hit the pods in that Service without incurring kube-proxy overhead and round-robin rerouting?

@joshk0
Copy link

joshk0 commented Jan 27, 2017

or, sorry: "the Ingress is configured to directly hit nodes, which have Pods that are members of the Service, without incurring kube-proxy overhead and round-robin routing?"

@thockin
Copy link
Member

thockin commented Jan 28, 2017 via email

@0xmichalis 0xmichalis added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed team/cluster (deprecated - do not use) labels Mar 20, 2017
@m1093782566
Copy link
Contributor

m1093782566 commented Dec 30, 2017

There is a proposal trying to resolve this issue, see kubernetes/community#1551

We need more user cases to refine the API, please feel free to populate your comments there :) Thanks!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2018
@m1093782566
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2018
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2018
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2018
@steebchen
Copy link

We currently build a traditional CDN, using anycast to route our users automatically to the nearest node (we have PoPs all around the world). However, with the current services, Kubernetes just load balances through all pods which basically makes them unusable for our use case, since a given node should also serve content and not proxy the request to a node on the other side of the world. That's probably a special use case which most users don't have, but since we're using anycast this problem is really driving us crazy.

@steven-sheehy
Copy link

@steebchen You can try swapping kube-proxy with kube-router as it supports other load-balancing algorithms besides round-robin.

@steebchen
Copy link

steebchen commented Oct 30, 2018

I checked that but I'm not sure if this is the right approach. First of all, this requires us to switch to kube-router (we use kube-proxy), but if I understood correctly we would have to get rid of weave net, which we need because of the encryption between nodes. Additionally, the only algorithm which is probably correct is the "Locality-Based Least-Connection Scheduling", but I'm already not so sure here:

If the server is overloaded (its active connection numbers is larger than its weight) and there is a server in its half load, then allocate the weighted least-connection server to this IP address.

Correct me if I'm wrong, but this does not seem to always prefer the local server. Plus, I already handle overloaded nodes by stop announcing IPs using BGP.

On top of that, this algorithm is not even listed in the kube-router LB options (only in the IPVPS docs).

Maybe I will just use hostPort on my pods directly, even if I don't really like that solution (I probably can't restart the pods without having a short downtime).

Since we use anycast with a bare metal cluster, I think I'll check out https://github.com/google/metallb and check whether we could use metallb Load Balancers.

@m1093782566
Copy link
Contributor

m1093782566 commented Oct 31, 2018

@steven-sheehy @steebchen

You can also look into kube-proxy IPVS mode(https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/ipvs) which supports all the algorithms of IPVS(kube-router do the same thing) without getting rid of weave net or any other network plugins.

@m1093782566
Copy link
Contributor

For node-local service and and other same topology-aware service routing, there is a proposal in progress, see: kubernetes/community#2846

@steebchen
Copy link

@m1093782566 The problem with the IPVS mode in kube-proxy is that you can only enable it globally. I would need to enable them using annotations for my service, but it's not supported yet. On top of that, I'm not sure if any of the load balancing algorithms are suitable, because I want to route traffic to local pods all (100%) of the time. First of all, it seems not all IPVS algorithms are supported (at least I couldn't find the options for them), and even the algorithm I think could work is probably not suited for my use case:

If the server is overloaded (its active connection numbers is larger than its weight) and there is a server in its half load, then allocate the weighted least-connection server to this IP address.

I basically want to always route to local pods. I will probably have to wait until services have an option to only route to local pods, or use google/metallb which I am currently checking out.

@m1093782566
Copy link
Contributor

m1093782566 commented Nov 1, 2018

@steebchen

First of all, it seems not all IPVS algorithms are supported (at least I couldn't find the options for them), and even the algorithm I think could work is probably not suited for my use case:

Please check: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-proxy/app/server.go#L174 and its all valid values can be found here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/apis/config/types.go#L172-L207

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2019
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2019
@mcginne
Copy link

mcginne commented Feb 8, 2019

Hi guys, I notice this issue is pretty old, but I saw this coment ->
#7433 (comment) which implies that in v1.5 NodePorts should prefer local pods. I'm running on kube 1.12 and seeing my NodePort services not preferring local. Is there a flag required, or has something changed between 1.5 -> 1.12? I'm using iptables.

I've also noted this enhancement -> https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0033-service-topology.md so just wanting to confirm the expected behaviour on kube 1.12?

@thockin thockin added the triage/unresolved Indicates an issue that can not or will not be resolved. label Mar 8, 2019
@thockin
Copy link
Member

thockin commented May 9, 2019

This is basically the same as #28610 so I am closing this one.

@mcginne look at the externalTrafficPolicy field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests