Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly add iptables rule to allow healthcheck nodeport #97824

Merged
merged 1 commit into from
Feb 4, 2021

Conversation

hanlins
Copy link
Member

@hanlins hanlins commented Jan 8, 2021

What type of PR is this?
/kind bug

What this PR does / why we need it:
Explicitly add iptables rule to expose healthcheck nodeport in filter table.

Which issue(s) this PR fixes:

Fixes #97225

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixed a bug that on k8s nodes, when the policy of INPUT chain in filter table is not ACCEPT, healthcheck nodeport would not work.
Added iptables rules to allow healthcheck nodeport traffic.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @hanlins. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 8, 2021
@pacoxu
Copy link
Member

pacoxu commented Jan 8, 2021

As the document says

It only has an effect when type is set to LoadBalancer and externalTrafficPolicy is set to Local.

Need we check this before adding iptables rule ?
(Just a question, I'm not quite sure about the behavior here. 😄)

@hanlins
Copy link
Member Author

hanlins commented Jan 8, 2021

Hi @pacoxu, api-server will ensure healthCheckNodePort field is only set when the service is of type LoadBalancer and externalTrafficPolicy is Local. This is a spec I'm using:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment3
  labels:
    app: nginx3
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx3
        #image: nginx:1.14.2
        image: us.gcr.io/google-containers/nginx
        ports:
        - containerPort: 80
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Equal"
        effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
  name: nginx3
spec:
  type: NodePort
  healthCheckNodePort: 30010
  externalTrafficPolicy: Local
  selector:
    app: nginx3
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80

And this is the error I got:

➜  ~ kubectl apply -f svc3.yaml
deployment.apps/nginx-deployment3 created
The Service "nginx3" is invalid: spec.healthCheckNodePort: Invalid value: 30010: HealthCheckNodePort can only be set on LoadBalancer service with ExternalTrafficPolicy=Local

That said I think we can safely assume that LB type and external traffic policy are set correctly if healthCheckNodePort is not zero.

@hanlins
Copy link
Member Author

hanlins commented Jan 8, 2021

Hi @andrewsykim, in filter table here're the k8s related chains:

:KUBE-EXTERNAL-SERVICES - [0:0]
:KUBE-FIREWALL - [0:0]
:KUBE-FORWARD - [0:0]
:KUBE-KUBELET-CANARY - [0:0]
:KUBE-PROXY-CANARY - [0:0]
:KUBE-SERVICES - [0:0]

currently, I'm adding rules that accept inbound health check traffic in KUBE-EXTERNAL-SERVICES chain, which might not be appropriate. Which chain is suitable for the rule that accepts health check traffic? Or do we need to add a new chain in that table?

@nicolehanjing
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 9, 2021
Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanlins can you also update IPVS proxier please? Similar iptables rules allowing the health check noder port should be sufficient

pkg/proxy/iptables/proxier.go Outdated Show resolved Hide resolved
// no matter if node has local endpoints, healthCheckEndpoints
// need to be exposed
writeLine(proxier.filterRules,
"-A", string(kubeExternalServicesChain),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KUBE-NODE-PORTS might make more sense here actually because healthCheckNodePort is technically still a "node port"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah originally there's no KUBE-NODE-PORT chain in the filter table for both iptables and ipvs versions of the kube-proxy. Just FYI that this change will add the node port chain to the filter table.

@hanlins hanlins force-pushed the fix/97225/hc-rules branch from 50a4c90 to aca369f Compare January 14, 2021 18:26
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 14, 2021
@hanlins hanlins force-pushed the fix/97225/hc-rules branch from aca369f to 8df8a4c Compare January 16, 2021 08:59
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 16, 2021
@hanlins hanlins changed the title [WIP] Explicitly add iptables rule to expose healthcheck nodeport Explicitly add iptables rule to expose healthcheck nodeport Jan 16, 2021
@k8s-ci-robot k8s-ci-robot added area/ipvs and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 16, 2021
@hanlins hanlins requested a review from andrewsykim February 1, 2021 23:11
@hanlins
Copy link
Member Author

hanlins commented Feb 1, 2021

/retest

@maplain
Copy link

maplain commented Feb 2, 2021

LGTM and thx @hanlins for adding the missing piece!

@aojea
Copy link
Member

aojea commented Feb 2, 2021

/lgtm
Thanks

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2021
@aojea
Copy link
Member

aojea commented Feb 2, 2021

/assign @danwinship
We need approval for the iptables part

@@ -395,6 +395,7 @@ type iptablesJumpChain struct {
var iptablesJumpChains = []iptablesJumpChain{
{utiliptables.TableFilter, kubeExternalServicesChain, utiliptables.ChainInput, "kubernetes externally-visible service portals", []string{"-m", "conntrack", "--ctstate", "NEW"}},
{utiliptables.TableFilter, kubeExternalServicesChain, utiliptables.ChainForward, "kubernetes externally-visible service portals", []string{"-m", "conntrack", "--ctstate", "NEW"}},
{utiliptables.TableFilter, kubeNodePortsChain, utiliptables.ChainInput, "kubernetes health check service portals", nil},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to add this; inbound connections already eventually pass through kubeNodePortsChain via the "tail-call" created at the end of syncProxyRules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that still true when the default INPUT chain policy is DROP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would work without this rule in my early attempts but turned out it didn't. The fact is for NodePort services, the traffic was DNATted in nat table, basically in this order:

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A KUBE-SERVICES -d 10.102.149.66/32 -p tcp -m comment --comment "default/nginx cluster IP" -m tcp --dport 8080 -j KUBE-SVC-2CMXP7HKUVJN7L6M
-A KUBE-SVC-2CMXP7HKUVJN7L6M -m comment --comment "default/nginx" -j KUBE-SEP-V6JOQ423IM4BM4HR
-A KUBE-SEP-V6JOQ423IM4BM4HR -s 100.96.0.11/32 -m comment --comment "default/nginx" -j KUBE-MARK-MASQ
-A KUBE-SEP-V6JOQ423IM4BM4HR -p tcp -m comment --comment "default/nginx" -m tcp -j DNAT --to-destination 100.96.0.11:80

In the rules above, 100.96.0.11 is the nginx pod IP.
In filter table, the traffic is accepted by the KUBE-FORWARD chain, see the third rule:

➜  ~ iptables -L KUBE-FORWARD --line-n -v
Chain KUBE-FORWARD (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 DROP       all  --  any    any     anywhere             anywhere             ctstate INVALID
2        0     0 ACCEPT     all  --  any    any     anywhere             anywhere             /* kubernetes forwarding rules */ mark match 0x4000/0x4000
3      235 42692 ACCEPT     all  --  any    any     anywhere             anywhere             /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
4        0     0 ACCEPT     all  --  any    any     anywhere             anywhere             /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

So, different from healthcheck nodeports, the traffic towards node ports got DNATted and instead of hitting INPUT chain, it will hit FORWARD chain. However health check node ports are listening on local addresses so the incoming packets will hit INPUT chain.

Before this change, there's no KUBE-NODEPORTS chain in the filter table at all, so whether the traffic is accepted or not is purely depending on the policy of INPUT chain. That's why we need to add a rule from INPUT chain to the newly added KUBE-NODEPORTS chain.

Speaking of the difference of node port traffic that is DNATted and the traffic towards health check endpoints that listens on local ports, I realized that if we have pods on hostNetwork, then node ports towards them won't work just like for health check node ports. I just validated that in my environment and the issue is really there, and I'm confident that I could fix that using the same approach just like for health check node port. I don't want to make more changes in the PR so maybe open an issue and fix it in another one, what do you guys think? @andrewsykim @danwinship ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that better to do that in a follow up PR,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... there are two completely separate KUBE-NODEPORTS chains now... one in -t nat and one in -t filter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think this is right, but don't use the word "portals" in the description. That's archaic pre-kube-1.0 terminology. Actually, just changing it to "ports" is probably what you meant anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed here.

@@ -1337,6 +1338,19 @@ func (proxier *Proxier) syncProxyRules() {
}
}

// Capture healthCheckNodePorts.
if svcInfo.HealthCheckNodePort() != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go inside the if svcInfo.NodePort() != 0 section

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node ports are now optional for load balancers (see spec.allocateLoadBalancerNodePorts), so a separate check still makes sense since healthCheckNodePorts is always set regardless of whether node ports is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second to @andrewsykim, I think these are effectively the same, for now, HealthCheckNodePort will be non-zero only if the svc is of LoadBalancer type. And if the service is of LoadBalancer type, then it should have a non-zero NodePort. To me it's api-server's job to validate the spec and ensure all the svc proxy sees are valid (see my previous comment). If couple the logic here then if someday the validation logic changed, then we also need to fix it here.

@@ -913,6 +913,50 @@ func TestNodePort(t *testing.T) {
}
}

func TestHealthCheckNodePort(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is if I just modify the INPUT chain policy to DROP then we might lose ssh connection to that VM instantly

You don't need to change the default INPUT chain policy, you can just do

iptables -A INPUT -p tcp --dport <PORT> -j DROP

or whatever

@@ -106,6 +106,7 @@ var iptablesJumpChain = []struct {
{utiliptables.TableNAT, utiliptables.ChainPrerouting, kubeServicesChain, "kubernetes service portals"},
{utiliptables.TableNAT, utiliptables.ChainPostrouting, kubePostroutingChain, "kubernetes postrouting rules"},
{utiliptables.TableFilter, utiliptables.ChainForward, KubeForwardChain, "kubernetes forwarding rules"},
{utiliptables.TableFilter, utiliptables.ChainInput, KubeNodePortChain, "kubernetes health check rules"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less familiar with the IPVS rules but I believe the comment I made about iptables should apply here too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah same reason as my previous comment why we need to keep this rule.

@@ -161,6 +163,7 @@ var ipsetInfo = []struct {
{kubeNodePortLocalSetUDP, utilipset.BitmapPort, kubeNodePortLocalSetUDPComment},
{kubeNodePortSetSCTP, utilipset.HashIPPort, kubeNodePortSetSCTPComment},
{kubeNodePortLocalSetSCTP, utilipset.HashIPPort, kubeNodePortLocalSetSCTPComment},
{kubeHealthCheckNodePortSet, utilipset.BitmapPort, kubeHealthCheckNodePortSetComment},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you also need to add something to ipsetWithIptablesChain...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha that's a tricky one, at first I thought I should add it here, but then I realized that ipsetWithIptablesChain is for nat table where we need to add it in filter table. I've already added the rule below:

       // Add rule to accept traffic towards health check node port
       utilproxy.WriteLine(proxier.filterRules,
               "-A", string(KubeNodePortChain),
               "-m", "comment", "--comment", proxier.ipsetList[kubeHealthCheckNodePortSet].getComment(),
               "-m", "set", "--match-set", proxier.ipsetList[kubeHealthCheckNodePortSet].Name, "dst",
               "-j", "ACCEPT",
       )

1. For iptables mode, add KUBE-NODEPORTS chain in filter table. Add
   rules to allow healthcheck node port traffic.
2. For ipvs mode, add KUBE-NODE-PORT chain in filter table. Add
   KUBE-HEALTH-CHECK-NODE-PORT ipset to allow traffic to healthcheck
   node port.
@hanlins hanlins force-pushed the fix/97225/hc-rules branch from 7c78180 to 4cd1eac Compare February 3, 2021 15:23
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2021
@hanlins
Copy link
Member Author

hanlins commented Feb 3, 2021

/retest

@danwinship
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 4, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, danwinship, hanlins

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Feb 4, 2021
@hanlins
Copy link
Member Author

hanlins commented Feb 4, 2021

/test pull-kubernetes-dependencies

@k8s-ci-robot k8s-ci-robot merged commit c1b3797 into kubernetes:master Feb 4, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 4, 2021
@andrewsykim
Copy link
Member

@hanlins can you include in the release notes that this fix is only required when the default INPUT chain policy is not ACCEPT?

@hanlins
Copy link
Member Author

hanlins commented Feb 5, 2021

@andrewsykim just rephrased, feel free to update it if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

healthCheckNodePort depends on the default iptables chain policy to be accessible externally
9 participants