-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iptables proxier: route local traffic to LB IPs to service chain #77523
iptables proxier: route local traffic to LB IPs to service chain #77523
Conversation
@m1093782566 @Lion-Wei any ideas if we need this for IPVS proxier? |
3a98e8c
to
cfa210f
Compare
/priority important-soon |
cfa210f
to
4c1f5d5
Compare
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
0cf0983
to
8dfd4de
Compare
Validated and tested this on a Kind cluster with metallb (thanks @mauilion!) and on GKE by applying the rules manually. @jcodybaker can you test this on DOKS please (re: #66607)? |
@andrewsykim - I've tested this and unfortunately it doesn't seemed to have changed the behavior seen in #66607. I've left my test cluster up, and can provide any debug that's helpful--just message on slack. I've started debugging by looking through the iptables counter. I didn't get the full picture tonight, but it's does seem to be taking a different path than we were seeing before. |
@jcodybaker can you share the output for |
@andrewsykim |
@kinolaev I think I agree with you but not sure yet. Traffic originating from the cluster cidr AND directed at the LB IP goes through the service chain if |
For |
Agreed that it shouldn't just be for consistency sake. The question I have boils down to: if we know the destination IPs for a given external LB IP, do we route back to external IP or route directly to the backend? We route to directly to service backend for the pod cidr case, wondering if there was a valid reason for that which also applies to nodes. Need to test all the scenarios first. |
I think routing directly to backend is good solution (and personally I prefer it). But if maintainers will decide that source IP must be preserved they will already have a solution for this. Please see my comment above about moving routing to SVC after check that node doesn't have local endpoints, because otherwise traffic must be routed locally. Do you agree? |
I'm not sure, need to put more thought into this because the existing local endpoint logic only applies for external traffic coming into the cluster (hence |
I see now the root of our discussion: we have different borders of "internal") For me "internal" means pods and services but you also include nodes to "internal". So for me nodes' outgoing traffic is "external" traffic and must be routed to local endpoints or load balancer. But if nodes' outgoing traffic is "internal" traffic then your solution is most logical. |
Err sorry, forgot to put a milestone on this one. Putting the milestone label so it's on the radar at least. It's a bit last minute so feel free to bump into next release (or we can cherry-pick it later) /milestone v1.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll approve now, but can you please send a minor fixup?
writeLine(proxier.natRules, append(args, | ||
"-m", "comment", "--comment", fmt.Sprintf(`"route LOCAL traffic for %s LB IP to service chain"`, svcNameString), | ||
"-m", "addrtype", "--src-type", "LOCAL", "-j", string(svcChain))...) | ||
|
||
// First rule in the chain redirects all pod -> external VIP traffic to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point - the comment here says "first rule", but you broke that :)
Can you put the new block after this block, and word the comments in similar fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks Tim!
Thanks! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@andrewsykim @kinolaev It's not just preserving the source IP that is the issue, the current behavior for routing traffic to LoadBalancer services within the cluster makes a false assumption that the LoadBalancer is always transparent. In the case of DigitalOcean, as an example, this is not the case - the LoadBalancer supports the PROXY protocol as well as TLS termination - sending traffic directly to a pod will often result in unexpected behavior as a result. This is only an issue when LoadBalancer's provide an IP address in the status spec since k8s does not attempt to magic the LoadBalancer away when a hostname is provided instead, as is the case with ELB's from AWS, for example. If your cloud provider supports network transparent load balancing than I don't think the optimization needs to be thrown away, but cloud controllers should have a way to indicate to the kublet that a LoadBalancer service is not transparent and shouldn't be routed internally. |
/cc @RainbowMango |
@kevin-wangzefeng: GitHub didn't allow me to request PR reviews from the following users: RainbowMango. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Kubernetes does not currently support routing packets from the host network interface to LoadBalancer Service external IPs. Although such routing tends to work on AWS with our current topology, it only works by coincidence. On other platforms like Azure and GCP, packets will not route to the load balancer except when the node is part of the load balancer target pool. At best, the level of support is undefined and all behavior in this context is considered unintentional upstream. This means that connectivity Routes on cloud platorms is only supported from public and container network clients. Refactor the e2e tests to exercise Route connectivity only through the container network. This eliminates a class of test flake whereby tests pass when the test pod (e.g. running curl against a Route) lands on a node hosting an ingresscontroller but fails if the pod happens to land on a node that's not hosting an ingresscontroller (which means the node is not part of the LB target pool). kubernetes/kubernetes#77523 kubernetes/kubernetes#65387
Kubernetes does not currently support routing packets from the host network interface to LoadBalancer Service external IPs. Although such routing tends to work on AWS with our current topology, it only works by coincidence. On other platforms like Azure and GCP, packets will not route to the load balancer except when the node is part of the load balancer target pool. At best, the level of support is undefined and all behavior in this context is considered unintentional by upstream. The practical implication is that connectivity to Routes on cloud platorms is only supported from public and container network clients. Refactor the e2e tests to exercise Route connectivity only through the container network. This eliminates a class of test flake whereby tests pass when the test pod (e.g. running curl against a Route) lands on a node hosting an ingresscontroller but fails if the pod happens to land on a node that's not hosting an ingresscontroller (which means the node is not part of the LB target pool). kubernetes/kubernetes#77523 kubernetes/kubernetes#65387
Signed-off-by: Andrew Sy Kim kiman@vmware.com
What type of PR is this?
/kind bug
What this PR does / why we need it:
For any traffic to an LB IP that originates from the local node, re-route that traffic to the Kubernetes service chain. This allows traffic to an external LB from inside a cluster reachable. The implication of this is that internal traffic to an LB IP will need to go through SNAT. This is likely okay since source IP preservation with externalTrafficPolicy=Local only applies for external traffic anyways. The fix was spelled out in more detail by Tim here #65387.
I think the correct behavior is to actually route the traffic to the LB instead of intercepting it with iptables but with the current set of rules I'm not sure this is possible. We also already have rules that route pods in the cluster cidr that want to reach LB IPs to the service chain:
Allowing traffic with
--src-type LOCAL
to do the same makes sense to meWhich issue(s) this PR fixes:
Fixes #65387 #66607
Special notes for your reviewer:
Does this PR introduce a user-facing change?: