Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iptables proxier: route local traffic to LB IPs to service chain #77523

Merged
merged 2 commits into from
May 31, 2019

Conversation

andrewsykim
Copy link
Member

@andrewsykim andrewsykim commented May 6, 2019

Signed-off-by: Andrew Sy Kim kiman@vmware.com

What type of PR is this?
/kind bug

What this PR does / why we need it:
For any traffic to an LB IP that originates from the local node, re-route that traffic to the Kubernetes service chain. This allows traffic to an external LB from inside a cluster reachable. The implication of this is that internal traffic to an LB IP will need to go through SNAT. This is likely okay since source IP preservation with externalTrafficPolicy=Local only applies for external traffic anyways. The fix was spelled out in more detail by Tim here #65387.

I think the correct behavior is to actually route the traffic to the LB instead of intercepting it with iptables but with the current set of rules I'm not sure this is possible. We also already have rules that route pods in the cluster cidr that want to reach LB IPs to the service chain:

-A KUBE-XLB-ECF5TUORC5E2ZCRD -s 10.8.0.0/14 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-ECF5TUORC5E2ZCRD

Allowing traffic with --src-type LOCAL to do the same makes sense to me

Which issue(s) this PR fixes:
Fixes #65387 #66607

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

iptables proxier: route local traffic to LB IPs to service chain

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 6, 2019
@k8s-ci-robot k8s-ci-robot requested review from justinsb and thockin May 6, 2019 22:26
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 6, 2019
@andrewsykim
Copy link
Member Author

@m1093782566 @Lion-Wei any ideas if we need this for IPVS proxier?

@andrewsykim andrewsykim force-pushed the fix-xlb-from-local branch from 3a98e8c to cfa210f Compare May 7, 2019 16:30
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2019
@andrewsykim
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 7, 2019
@andrewsykim andrewsykim force-pushed the fix-xlb-from-local branch from cfa210f to 4c1f5d5 Compare May 7, 2019 16:36
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2019
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
Signed-off-by: Andrew Sy Kim <kiman@vmware.com>
@andrewsykim andrewsykim force-pushed the fix-xlb-from-local branch from 0cf0983 to 8dfd4de Compare May 7, 2019 19:22
@andrewsykim
Copy link
Member Author

Validated and tested this on a Kind cluster with metallb (thanks @mauilion!) and on GKE by applying the rules manually. @jcodybaker can you test this on DOKS please (re: #66607)?

@jcodybaker
Copy link

@andrewsykim - I've tested this and unfortunately it doesn't seemed to have changed the behavior seen in #66607. I've left my test cluster up, and can provide any debug that's helpful--just message on slack. I've started debugging by looking through the iptables counter. I didn't get the full picture tonight, but it's does seem to be taking a different path than we were seeing before.

@andrewsykim
Copy link
Member Author

@jcodybaker can you share the output for sudo iptables-save please?

@mauilion
Copy link

mauilion commented May 8, 2019

@andrewsykim
In my testing I was able to reproduce this case:
Bring up a 3 node cluster and schedule pods on 2 of them.
Define a service of type loadbalancer and ensure that the loadbalancer provides an ip address
modify the service and set externalTrafficPolicy to Local
Bring up a pod with hostNetwork: true on the node where there are no endpoints.
before your change I am not able to connect to the service.
after your change I can.

@andrewsykim
Copy link
Member Author

@kinolaev I think I agree with you but not sure yet. Traffic originating from the cluster cidr AND directed at the LB IP goes through the service chain if externalTrafficPolicy=Local. Shouldn't the same policy apply for nodes then? Either way, let me try to test your PR to make sure there's no edge cases missing there. I wonder if the src-type check can lead to interesting behavior if we go back out to the external LB and back into the node pool.

@kinolaev
Copy link

kinolaev commented May 29, 2019

Shouldn't the same policy apply for nodes then?

For externalTrafficPolicy=Local we can remove FW chain on nodes that don’t have local endpoints to send all traffic to LB. This solution looks more consistent, but requires network access from node to LB for traffic from pods and has no benefit (source IP will be lost). I don’t think that we need to choose this solution just for consistency.

@andrewsykim
Copy link
Member Author

So I don’t think that we need to choose this solution just for consistency.

Agreed that it shouldn't just be for consistency sake. The question I have boils down to: if we know the destination IPs for a given external LB IP, do we route back to external IP or route directly to the backend? We route to directly to service backend for the pod cidr case, wondering if there was a valid reason for that which also applies to nodes. Need to test all the scenarios first.

@kinolaev
Copy link

I think routing directly to backend is good solution (and personally I prefer it). But if maintainers will decide that source IP must be preserved they will already have a solution for this.

Please see my comment above about moving routing to SVC after check that node doesn't have local endpoints, because otherwise traffic must be routed locally. Do you agree?

@andrewsykim
Copy link
Member Author

Please see my comment above about moving routing to SVC after check that node doesn't have local endpoints, because otherwise traffic must be routed locally. Do you agree?

I'm not sure, need to put more thought into this because the existing local endpoint logic only applies for external traffic coming into the cluster (hence externalTrafficPolicy=Local). Preferring local endpoint if they exist and falling back to service chain sounds right but I don't think it's that simple, we are possibly changing the expectations for internal traffic in ways that can be unexpected. This would be addressed by on-going work for service topology which I would say is out-of-scope for what we're trying to fix here. I'm not the authoritative source on this though :)

@kinolaev
Copy link

I see now the root of our discussion: we have different borders of "internal") For me "internal" means pods and services but you also include nodes to "internal". So for me nodes' outgoing traffic is "external" traffic and must be routed to local endpoints or load balancer. But if nodes' outgoing traffic is "internal" traffic then your solution is most logical.
@thockin, waiting for you)

@andrewsykim
Copy link
Member Author

andrewsykim commented May 30, 2019

Err sorry, forgot to put a milestone on this one. Putting the milestone label so it's on the radar at least. It's a bit last minute so feel free to bump into next release (or we can cherry-pick it later)

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 30, 2019
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll approve now, but can you please send a minor fixup?

writeLine(proxier.natRules, append(args,
"-m", "comment", "--comment", fmt.Sprintf(`"route LOCAL traffic for %s LB IP to service chain"`, svcNameString),
"-m", "addrtype", "--src-type", "LOCAL", "-j", string(svcChain))...)

// First rule in the chain redirects all pod -> external VIP traffic to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point - the comment here says "first rule", but you broke that :)

Can you put the new block after this block, and word the comments in similar fashion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks Tim!

@thockin
Copy link
Member

thockin commented May 31, 2019

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2019
@k8s-ci-robot k8s-ci-robot merged commit bdf3d24 into kubernetes:master May 31, 2019
@snuxoll
Copy link

snuxoll commented Jun 20, 2019

@andrewsykim @kinolaev It's not just preserving the source IP that is the issue, the current behavior for routing traffic to LoadBalancer services within the cluster makes a false assumption that the LoadBalancer is always transparent. In the case of DigitalOcean, as an example, this is not the case - the LoadBalancer supports the PROXY protocol as well as TLS termination - sending traffic directly to a pod will often result in unexpected behavior as a result.

This is only an issue when LoadBalancer's provide an IP address in the status spec since k8s does not attempt to magic the LoadBalancer away when a hostname is provided instead, as is the case with ELB's from AWS, for example. If your cloud provider supports network transparent load balancing than I don't think the optimization needs to be thrown away, but cloud controllers should have a way to indicate to the kublet that a LoadBalancer service is not transparent and shouldn't be routed internally.

@kevin-wangzefeng
Copy link
Member

/cc @RainbowMango

@k8s-ci-robot
Copy link
Contributor

@kevin-wangzefeng: GitHub didn't allow me to request PR reviews from the following users: RainbowMango.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @RainbowMango

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou added a commit to ironcladlou/origin that referenced this pull request Aug 28, 2019
Kubernetes does not currently support routing packets from the host network
interface to LoadBalancer Service external IPs. Although such routing tends to
work on AWS with our current topology, it only works by coincidence. On other
platforms like Azure and GCP, packets will not route to the load balancer except
when the node is part of the load balancer target pool. At best, the level of
support is undefined and all behavior in this context is considered
unintentional upstream.

This means that connectivity Routes on cloud platorms is only supported from
public and container network clients.

Refactor the e2e tests to exercise Route connectivity only through the container
network. This eliminates a class of test flake whereby tests pass when the test
pod (e.g. running curl against a Route) lands on a node hosting an
ingresscontroller but fails if the pod happens to land on a node that's not
hosting an ingresscontroller (which means the node is not part of the LB target
pool).

kubernetes/kubernetes#77523
kubernetes/kubernetes#65387
ironcladlou added a commit to ironcladlou/origin that referenced this pull request Aug 28, 2019
Kubernetes does not currently support routing packets from the host network
interface to LoadBalancer Service external IPs. Although such routing tends to
work on AWS with our current topology, it only works by coincidence. On other
platforms like Azure and GCP, packets will not route to the load balancer except
when the node is part of the load balancer target pool. At best, the level of
support is undefined and all behavior in this context is considered
unintentional by upstream.

The practical implication is that connectivity to Routes on cloud platorms is
only supported from public and container network clients.

Refactor the e2e tests to exercise Route connectivity only through the container
network. This eliminates a class of test flake whereby tests pass when the test
pod (e.g. running curl against a Route) lands on a node hosting an
ingresscontroller but fails if the pod happens to land on a node that's not
hosting an ingresscontroller (which means the node is not part of the LB target
pool).

kubernetes/kubernetes#77523
kubernetes/kubernetes#65387
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

externalTrafficPolicy: Local breaks internal reachability