iptables proxier: route local traffic to LB IPs to service chain #77523

andrewsykim · 2019-05-06T22:25:54Z

Signed-off-by: Andrew Sy Kim kiman@vmware.com

What type of PR is this?
/kind bug

What this PR does / why we need it:
For any traffic to an LB IP that originates from the local node, re-route that traffic to the Kubernetes service chain. This allows traffic to an external LB from inside a cluster reachable. The implication of this is that internal traffic to an LB IP will need to go through SNAT. This is likely okay since source IP preservation with externalTrafficPolicy=Local only applies for external traffic anyways. The fix was spelled out in more detail by Tim here #65387.

I think the correct behavior is to actually route the traffic to the LB instead of intercepting it with iptables but with the current set of rules I'm not sure this is possible. We also already have rules that route pods in the cluster cidr that want to reach LB IPs to the service chain:

-A KUBE-XLB-ECF5TUORC5E2ZCRD -s 10.8.0.0/14 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-ECF5TUORC5E2ZCRD

Allowing traffic with --src-type LOCAL to do the same makes sense to me

Which issue(s) this PR fixes:
Fixes #65387 #66607

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

iptables proxier: route local traffic to LB IPs to service chain

andrewsykim · 2019-05-06T22:29:40Z

@m1093782566 @Lion-Wei any ideas if we need this for IPVS proxier?

andrewsykim · 2019-05-07T16:31:05Z

/priority important-soon

Signed-off-by: Andrew Sy Kim <kiman@vmware.com>

andrewsykim · 2019-05-07T20:09:55Z

Validated and tested this on a Kind cluster with metallb (thanks @mauilion!) and on GKE by applying the rules manually. @jcodybaker can you test this on DOKS please (re: #66607)?

jcodybaker · 2019-05-08T03:06:05Z

@andrewsykim - I've tested this and unfortunately it doesn't seemed to have changed the behavior seen in #66607. I've left my test cluster up, and can provide any debug that's helpful--just message on slack. I've started debugging by looking through the iptables counter. I didn't get the full picture tonight, but it's does seem to be taking a different path than we were seeing before.

andrewsykim · 2019-05-08T04:12:41Z

@jcodybaker can you share the output for sudo iptables-save please?

mauilion · 2019-05-08T16:57:03Z

@andrewsykim
In my testing I was able to reproduce this case:
Bring up a 3 node cluster and schedule pods on 2 of them.
Define a service of type loadbalancer and ensure that the loadbalancer provides an ip address
modify the service and set externalTrafficPolicy to Local
Bring up a pod with hostNetwork: true on the node where there are no endpoints.
before your change I am not able to connect to the service.
after your change I can.

andrewsykim · 2019-05-28T19:58:29Z

@kinolaev I think I agree with you but not sure yet. Traffic originating from the cluster cidr AND directed at the LB IP goes through the service chain if externalTrafficPolicy=Local. Shouldn't the same policy apply for nodes then? Either way, let me try to test your PR to make sure there's no edge cases missing there. I wonder if the src-type check can lead to interesting behavior if we go back out to the external LB and back into the node pool.

kinolaev · 2019-05-29T04:42:46Z

Shouldn't the same policy apply for nodes then?

For externalTrafficPolicy=Local we can remove FW chain on nodes that don’t have local endpoints to send all traffic to LB. This solution looks more consistent, but requires network access from node to LB for traffic from pods and has no benefit (source IP will be lost). I don’t think that we need to choose this solution just for consistency.

andrewsykim · 2019-05-29T04:50:09Z

So I don’t think that we need to choose this solution just for consistency.

Agreed that it shouldn't just be for consistency sake. The question I have boils down to: if we know the destination IPs for a given external LB IP, do we route back to external IP or route directly to the backend? We route to directly to service backend for the pod cidr case, wondering if there was a valid reason for that which also applies to nodes. Need to test all the scenarios first.

kinolaev · 2019-05-29T05:12:30Z

I think routing directly to backend is good solution (and personally I prefer it). But if maintainers will decide that source IP must be preserved they will already have a solution for this.

Please see my comment above about moving routing to SVC after check that node doesn't have local endpoints, because otherwise traffic must be routed locally. Do you agree?

andrewsykim · 2019-05-29T05:37:24Z

Please see my comment above about moving routing to SVC after check that node doesn't have local endpoints, because otherwise traffic must be routed locally. Do you agree?

I'm not sure, need to put more thought into this because the existing local endpoint logic only applies for external traffic coming into the cluster (hence externalTrafficPolicy=Local). Preferring local endpoint if they exist and falling back to service chain sounds right but I don't think it's that simple, we are possibly changing the expectations for internal traffic in ways that can be unexpected. This would be addressed by on-going work for service topology which I would say is out-of-scope for what we're trying to fix here. I'm not the authoritative source on this though :)

kinolaev · 2019-05-30T10:34:59Z

I see now the root of our discussion: we have different borders of "internal") For me "internal" means pods and services but you also include nodes to "internal". So for me nodes' outgoing traffic is "external" traffic and must be routed to local endpoints or load balancer. But if nodes' outgoing traffic is "internal" traffic then your solution is most logical.
@thockin, waiting for you)

andrewsykim · 2019-05-30T13:02:35Z

Err sorry, forgot to put a milestone on this one. Putting the milestone label so it's on the radar at least. It's a bit last minute so feel free to bump into next release (or we can cherry-pick it later)

/milestone v1.15

thockin

Thanks. I'll approve now, but can you please send a minor fixup?

thockin · 2019-05-31T04:37:08Z

pkg/proxy/iptables/proxier.go

+		writeLine(proxier.natRules, append(args,
+			"-m", "comment", "--comment", fmt.Sprintf(`"route LOCAL traffic for %s LB IP to service chain"`, svcNameString),
+			"-m", "addrtype", "--src-type", "LOCAL", "-j", string(svcChain))...)
+
 		// First rule in the chain redirects all pod -> external VIP traffic to the


Minor point - the comment here says "first rule", but you broke that :)

Can you put the new block after this block, and word the comments in similar fashion?

Will do, thanks Tim!

thockin · 2019-05-31T04:39:25Z

Thanks!

/lgtm
/approve

k8s-ci-robot · 2019-05-31T04:40:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/proxy/OWNERS~~ [thockin]
~~pkg/util/iptables/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

snuxoll · 2019-06-20T20:55:55Z

@andrewsykim @kinolaev It's not just preserving the source IP that is the issue, the current behavior for routing traffic to LoadBalancer services within the cluster makes a false assumption that the LoadBalancer is always transparent. In the case of DigitalOcean, as an example, this is not the case - the LoadBalancer supports the PROXY protocol as well as TLS termination - sending traffic directly to a pod will often result in unexpected behavior as a result.

This is only an issue when LoadBalancer's provide an IP address in the status spec since k8s does not attempt to magic the LoadBalancer away when a hostname is provided instead, as is the case with ELB's from AWS, for example. If your cloud provider supports network transparent load balancing than I don't think the optimization needs to be thrown away, but cloud controllers should have a way to indicate to the kublet that a LoadBalancer service is not transparent and shouldn't be routed internally.

kevin-wangzefeng · 2019-08-28T01:15:11Z

/cc @RainbowMango

k8s-ci-robot · 2019-08-28T01:15:12Z

@kevin-wangzefeng: GitHub didn't allow me to request PR reviews from the following users: RainbowMango.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @RainbowMango

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Kubernetes does not currently support routing packets from the host network interface to LoadBalancer Service external IPs. Although such routing tends to work on AWS with our current topology, it only works by coincidence. On other platforms like Azure and GCP, packets will not route to the load balancer except when the node is part of the load balancer target pool. At best, the level of support is undefined and all behavior in this context is considered unintentional upstream. This means that connectivity Routes on cloud platorms is only supported from public and container network clients. Refactor the e2e tests to exercise Route connectivity only through the container network. This eliminates a class of test flake whereby tests pass when the test pod (e.g. running curl against a Route) lands on a node hosting an ingresscontroller but fails if the pod happens to land on a node that's not hosting an ingresscontroller (which means the node is not part of the LB target pool). kubernetes/kubernetes#77523 kubernetes/kubernetes#65387

Kubernetes does not currently support routing packets from the host network interface to LoadBalancer Service external IPs. Although such routing tends to work on AWS with our current topology, it only works by coincidence. On other platforms like Azure and GCP, packets will not route to the load balancer except when the node is part of the load balancer target pool. At best, the level of support is undefined and all behavior in this context is considered unintentional by upstream. The practical implication is that connectivity to Routes on cloud platorms is only supported from public and container network clients. Refactor the e2e tests to exercise Route connectivity only through the container network. This eliminates a class of test flake whereby tests pass when the test pod (e.g. running curl against a Route) lands on a node hosting an ingresscontroller but fails if the pod happens to land on a node that's not hosting an ingresscontroller (which means the node is not part of the LB target pool). kubernetes/kubernetes#77523 kubernetes/kubernetes#65387

k8s-ci-robot requested review from justinsb and thockin May 6, 2019 22:26

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2019

This was referenced May 7, 2019

externalTrafficPolicy: Local breaks internal reachability #65387

Closed

Why kube-proxy add external-lb's address to node local iptables rule? #66607

Closed

andrewsykim force-pushed the fix-xlb-from-local branch from 3a98e8c to cfa210f Compare May 7, 2019 16:30

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 7, 2019

andrewsykim force-pushed the fix-xlb-from-local branch from cfa210f to 4c1f5d5 Compare May 7, 2019 16:36

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2019

andrewsykim added 2 commits May 7, 2019 15:22

iptables proxier: route local traffic to LB IPs to service chain

b926fb9

Signed-off-by: Andrew Sy Kim <kiman@vmware.com>

add unit tests for -src-type=LOCAL from LB chain

8dfd4de

Signed-off-by: Andrew Sy Kim <kiman@vmware.com>

andrewsykim force-pushed the fix-xlb-from-local branch from 0cf0983 to 8dfd4de Compare May 7, 2019 19:22

whereisaaron mentioned this pull request May 7, 2019

proxy_protocol mode breaks HTTP01 challenge Check stage cert-manager/cert-manager#466

Closed

k8s-ci-robot added this to the v1.15 milestone May 30, 2019

thockin reviewed May 31, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2019

k8s-ci-robot merged commit bdf3d24 into kubernetes:master May 31, 2019

andrewsykim mentioned this pull request Jun 3, 2019

iptables proxier: fix comments for LB IP traffic from local address #78662

Merged

ffilippopoulos mentioned this pull request Jun 17, 2019

Accessing service IPs from nodes (uncontained) fails in some setups metallb/metallb#258

Closed

andrewsykim mentioned this pull request Jul 2, 2019

[wip]fix externalTrafficPolicy: Local breaks internal reachability #67384

Closed

ironcladlou mentioned this pull request Aug 28, 2019

e2e: use container network to access routes openshift/origin#23688

Merged

salanki mentioned this pull request Oct 1, 2019

externalTrafficPolicy:Local and proxy-mode=ipvs blackholes traffic on #75262

Closed

adamcooke mentioned this pull request Jan 15, 2020

Issue accessing LB service from pods on same node as the service's pod #87263

Closed

MorrisLaw mentioned this pull request Mar 31, 2020

Document PRs #77523/#78662: iptables proxier - route local traffic to LB IPs to service chain contributing-to-kubernetes/gnosis#35

Open

aojea mentioned this pull request Jul 3, 2020

Node Port with externalTrafficPolicy:Local blocked on nodes with ingress controller #92761

Closed

andrewsykim mentioned this pull request Jul 27, 2020

[ipvs] in-cluster traffic for loadbalancer IP with externalTrafficPolicy=Local should use cluster-wide endpoints #93456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iptables proxier: route local traffic to LB IPs to service chain #77523

iptables proxier: route local traffic to LB IPs to service chain #77523

andrewsykim commented May 6, 2019 •

edited

Loading

andrewsykim commented May 6, 2019

andrewsykim commented May 7, 2019

andrewsykim commented May 7, 2019

jcodybaker commented May 8, 2019

andrewsykim commented May 8, 2019

mauilion commented May 8, 2019

andrewsykim commented May 28, 2019

kinolaev commented May 29, 2019 •

edited

Loading

andrewsykim commented May 29, 2019

kinolaev commented May 29, 2019

andrewsykim commented May 29, 2019

kinolaev commented May 30, 2019

andrewsykim commented May 30, 2019 •

edited

Loading

thockin left a comment

thockin May 31, 2019

andrewsykim May 31, 2019

thockin commented May 31, 2019

k8s-ci-robot commented May 31, 2019

snuxoll commented Jun 20, 2019

kevin-wangzefeng commented Aug 28, 2019

k8s-ci-robot commented Aug 28, 2019

iptables proxier: route local traffic to LB IPs to service chain #77523

iptables proxier: route local traffic to LB IPs to service chain #77523

Conversation

andrewsykim commented May 6, 2019 • edited Loading

andrewsykim commented May 6, 2019

andrewsykim commented May 7, 2019

andrewsykim commented May 7, 2019

jcodybaker commented May 8, 2019

andrewsykim commented May 8, 2019

mauilion commented May 8, 2019

andrewsykim commented May 28, 2019

kinolaev commented May 29, 2019 • edited Loading

andrewsykim commented May 29, 2019

kinolaev commented May 29, 2019

andrewsykim commented May 29, 2019

kinolaev commented May 30, 2019

andrewsykim commented May 30, 2019 • edited Loading

thockin left a comment

Choose a reason for hiding this comment

thockin May 31, 2019

Choose a reason for hiding this comment

andrewsykim May 31, 2019

Choose a reason for hiding this comment

thockin commented May 31, 2019

k8s-ci-robot commented May 31, 2019

snuxoll commented Jun 20, 2019

kevin-wangzefeng commented Aug 28, 2019

k8s-ci-robot commented Aug 28, 2019

andrewsykim commented May 6, 2019 •

edited

Loading

kinolaev commented May 29, 2019 •

edited

Loading

andrewsykim commented May 30, 2019 •

edited

Loading