-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy loadbalancing: Use local pods only #7433
Comments
It's worth doing some thinking about the potential failure cases, but if it checks out we'll happily accept PRs :) |
This is actually something I want, though I described it a bit differently. If pods that back this service exist on this node, always route to them That said, this is going to be tricky to test and prove correct, and we're On Tue, Apr 28, 2015 at 10:16 AM, Alex Robinson notifications@github.com
|
This could be slightly more relevant now with Ingress and such, especially if the ingress controller is running in a daemonset and already RR'd before getting to kube-proxy, it would be nice to just use that node. |
I think this is a pretty important feature. Especially if you have a lot of hops and larger replica sizes for your pods. I'd like to look into creating a PR for this. I am not 100% sure if this is the right approach, but I am planning to do:
Btw: I just figured out that 1.4's EndpointAddress contains a nodeName. Maybe that is the better way to determine local pods (As it should work for hostNetwork=true as well) |
This was merged for 1.4, but ONLY for external LB VIP traffic (i.e. not for NodePorts and not for internal service IPs). This will give us a chance to get some miles on the idea, though. You already found some of the implementation. |
@simonswine NodeName was added to EndpointAddress in v1.4 precisely to allow kube-proxy to filter local endpoints for various future purposes, in 1.4 it is only using this data to create a new KUBE-XLB- iptables chain for specially annotated services and only if the alpha feature gate is enabled. BTW, I would prefer not having a kube-proxy cmd line flag (which is an all-services-behave-this-way), but rather make it an annotation on the service, letting all service flavors work. |
@thockin If I want this for NodePort and ClusterIPs today, am I looking at complex surgery in the internals of kube-proxy or a flag flip to enable it? Background: Our setup right now is that we have all pods IPs routed and accessible from everywhere within our network (i.e. far outside of Kubernetes' area of control) via BGP. We're also exporting Cluster IPs ("A.B.C.D via pod-abcef") from the Service + Endpoint objects. This works well to route the packets to a node that is able to serve the traffic -- but AFAICT the probabilistic loadbalancing in place today will most likely route the packet away from the node, even though it has already been routed to a perfectly good node (and balancing is done away from K8s). |
@bprashanth added NodePort support for v1.5. ClusterIP is more Finally, we have this other case where traffic for a clusterIP is routed to On Mon, Nov 21, 2016 at 6:57 AM, Christian Svensson <
|
@thockin I see - it makes sense. We're running the latest and greatest 1.5 so I guess we could use NodePort. It seems a bit less "clean" as there is another resource (the node port) mixed into the game, but for now that will work for us. FWIW, for Cluster IPs we're doing /32 announces. |
I am open to proposals of how to design this wrt clusterIPs, for cases such On Mon, Nov 21, 2016 at 9:53 AM, Christian Svensson <
|
I've been giving this a thought, and for us we would like:
I'll write together a bigger document with our setup and the different cases (node going down, pod becoming unhealthy, etc.) later, but the above should be enough to get the basic discussion started I hope. |
@thockin The more I think about this the more I realize that maybe it would be a good idea to offer something like CNI but for internal traffic flows as well. We had even more discussions about this and we have a bunch of ideas we want to try out - like using ECMP on the node itself through normal routing - but as far as I can tell there is no pluggable way of totally ripping out the forwarding logic. |
Maybe I misunderstand. If you just want to change the way Services work,
you can replace kube-proxy with something else...
…On Thu, Nov 24, 2016 at 5:52 AM, Christian Svensson < ***@***.***> wrote:
@thockin <https://github.com/thockin> The more I think about this the
more I realize that maybe it would be a good idea to offer something like
CNI but for internal traffic flows as well. We had even more discussions
about this and we have a bunch of ideas we want to try out - like using
ECMP on the node itself through normal routing - but as far as I can tell
there is no pluggable way of totally ripping out the forwarding logic.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7433 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVMgVPKxmbSOe_FvSwxYX8YXStRBeks5rBZaRgaJpZM4EKqpn>
.
|
Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the packet forwarding path. We would like to experiment with doing things like IPIP / GRE forwarding and real L3 routing which requires quite some intelligence and integration with both the pods running the services and the nodes themselves. |
What exactly are you missing in today's config, assuming you replace the
static pod kube-proxy runs in with your own hostnetwork daemon to manage
service vips and replace the kubenet binaries, which is the default CNI
plugin on node, with your own to manage pod networking? You might need to
turn off the route controller through a flag to the controller-manager (I'm
blanking on the name) and write your own, but it sounds like what you want
is already within reach. You can't manage service ip or node port
allocation, but you can specify the block of ips/ports to choose from.
…On Thu, Nov 24, 2016 at 5:14 PM, Christian Svensson < ***@***.***> wrote:
Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the
packet forwarding path. We would like to experiment with doing things like
IPIP / GRE forwarding and real L3 routing which requires quite some
intelligence and integration with both the pods running the services and
the nodes themselves.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7433 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKa-zAk1vmjR-xuscceZbjg1K-o6JeTYks5rBjZjgaJpZM4EKqpn>
.
|
Prashanth's answer is good, I think. If you stop kube-proxy and clean up
the iptables rules, you can create an IP interface and use some other form
of route discovery.
…On Nov 24, 2016 5:14 PM, "Christian Svensson" ***@***.***> wrote:
Maybe I'm misunderstanding how kube-proxy works, but AFAICT it is in the
packet forwarding path. We would like to experiment with doing things like
IPIP / GRE forwarding and real L3 routing which requires quite some
intelligence and integration with both the pods running the services and
the nodes themselves.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7433 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVBX3ck5Mfh6tW3QTxm-8hwPdtwUsks5rBjZegaJpZM4EKqpn>
.
|
Interesting, I'll give that a try. Thanks! |
Another thing to consider in this area is that it may make sense to generalize a solution so that kube-proxy supports other routing preferences. We are talking about node local, but I could easily see AZ local and cluster local(in federated clusters). More dynamically, someone could even build support in kube-proxy for latency based service routing, which might take care of all of the solutions at once. Likely that's more difficult, but it'd be good to at least consider putting it on the roadmap. |
I see that this bug is marked as "awaiting more evidence" and "team/cluster (deprecated)". Is there a way to breath more life into this bug? While I think for my use case I can work around by implementing GRE-encapsulation, I think it would be nice to have this feature to offer very simple load-balanced IP-level ingress. |
@bluecmd This is open for proposals. It's not something I or my team is working on at the moment. I feel like we have laid out some options for exploration, and some of the likely hazards - enough to get a motivated person or team to investigate, I hope. |
@thockin Fair enough. I did some thinking about this after our proof-of-concept and I agree the corner cases are not easily solved. FWIW, the approach I'm going forward with is tunnels using https://github.com/bluecmd/kube-service-tunnel |
@thockin, could you please clarify? Does this mean if I have an Ingress (GCE L7) pointing to a Service, the Ingress is configured to directly hit the pods in that Service without incurring kube-proxy overhead and round-robin rerouting? |
or, sorry: "the Ingress is configured to directly hit nodes, which have Pods that are members of the Service, without incurring kube-proxy overhead and round-robin routing?" |
Ingress is not affected by this. GCE's L7 saves the client IP in the
X-Forwarded-For header.
…On Fri, Jan 27, 2017 at 1:26 PM, Joshua Kwan ***@***.***> wrote:
or, sorry: "the Ingress is configured to directly hit nodes, which have
Pods that are members of the Service, without incurring kube-proxy overhead
and round-robin routing?"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7433 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVFWhzvWDNS6AOA_uWHEIp8GemMxyks5rWmECgaJpZM4EKqpn>
.
|
There is a proposal trying to resolve this issue, see kubernetes/community#1551 We need more user cases to refine the API, please feel free to populate your comments there :) Thanks! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
We currently build a traditional CDN, using anycast to route our users automatically to the nearest node (we have PoPs all around the world). However, with the current services, Kubernetes just load balances through all pods which basically makes them unusable for our use case, since a given node should also serve content and not proxy the request to a node on the other side of the world. That's probably a special use case which most users don't have, but since we're using anycast this problem is really driving us crazy. |
@steebchen You can try swapping kube-proxy with kube-router as it supports other load-balancing algorithms besides round-robin. |
I checked that but I'm not sure if this is the right approach. First of all, this requires us to switch to kube-router (we use kube-proxy), but if I understood correctly we would have to get rid of weave net, which we need because of the encryption between nodes. Additionally, the only algorithm which is probably correct is the "Locality-Based Least-Connection Scheduling", but I'm already not so sure here:
Correct me if I'm wrong, but this does not seem to always prefer the local server. Plus, I already handle overloaded nodes by stop announcing IPs using BGP. On top of that, this algorithm is not even listed in the kube-router LB options (only in the IPVPS docs). Maybe I will just use hostPort on my pods directly, even if I don't really like that solution (I probably can't restart the pods without having a short downtime). Since we use anycast with a bare metal cluster, I think I'll check out https://github.com/google/metallb and check whether we could use metallb Load Balancers. |
You can also look into kube-proxy IPVS mode(https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/ipvs) which supports all the algorithms of IPVS(kube-router do the same thing) without getting rid of weave net or any other network plugins. |
For node-local service and and other same topology-aware service routing, there is a proposal in progress, see: kubernetes/community#2846 |
@m1093782566 The problem with the IPVS mode in kube-proxy is that you can only enable it globally. I would need to enable them using annotations for my service, but it's not supported yet. On top of that, I'm not sure if any of the load balancing algorithms are suitable, because I want to route traffic to local pods all (100%) of the time. First of all, it seems not all IPVS algorithms are supported (at least I couldn't find the options for them), and even the algorithm I think could work is probably not suited for my use case:
I basically want to always route to local pods. I will probably have to wait until services have an option to only route to local pods, or use google/metallb which I am currently checking out. |
Please check: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-proxy/app/server.go#L174 and its all valid values can be found here: https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/apis/config/types.go#L172-L207 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Hi guys, I notice this issue is pretty old, but I saw this coment -> I've also noted this enhancement -> https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0033-service-topology.md so just wanting to confirm the expected behaviour on kube 1.12? |
As far as I can see kube-proxy uses round robin to schedule requests to pods within the controller.
How about an option to use only pods that are running on the same minion, if available? That would save a lot of unnecessary network traffic
The text was updated successfully, but these errors were encountered: