Stop trying to "preserve" iptables counters that are always 0 #110328

danwinship · 2022-06-01T15:36:51Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

@thockin had mentioned a while back that he was pretty sure that all iptables counters are always 0 and we're doing pointless busywork pretending otherwise, and this turns out to be mostly correct.

Specifically, the chain-level counters that our code deals with (as opposed to the rule-level counters that we ignore) are only used for the built-in chains (INPUT, PREROUTING, etc) that have a default policy, and they count the number of packets/bytes that hit that default policy (ie, that fall off the end of the chain rather than matching a terminal rule). For user-created chains, these counters will always be 0, and can't even manually be set to any other value (ie, if you try to set them in the iptables-restore input, the values you provide will be ignored). This is true for both iptables-legacy and iptables-nft.

We never pass iptables-restore any input that refers to the built-in chains, so all of our efforts to preserve the counters are pointless. And we don't have to worry about "but what if we wanted to change things in the future so that we did refer to those chains?" because there's no way we could refer to the built-in chains in our iptables-restore input without creating race conditions that might accidentally overwrite concurrent modifications to those chains by other users. (This is why the code still uses ipt.EnsureChain/ipt.EnsureRule for some things rather than doing everything with ipt.Restore.)

Special notes for your reviewer:

(After this PR, the only thing we use the iptables-save output for is to figure out which service/endpoint chains we need to delete. We could theoretically figure that out without using iptables-save, but that would require us to more-or-less keep a copy of the current known iptables state in memory at all times. And we'd still need the iptables-save version at startup time anyway. A better optimization would be to change it to only do stale-chain cleanup periodically (eg, once a minute) rather than on every resync; the stale chains wouldn't have any rules pointing to them, so they wouldn't cause any problems beyond memory usage. #110334 does that.)

Does this PR introduce a user-facing change?

NONE

/sig network
/priority important-longterm

k8s-ci-robot · 2022-06-01T15:36:59Z

@danwinship: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-06-01T15:37:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/proxy/OWNERS~~ [danwinship]
~~pkg/util/iptables/OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2022-06-02T10:52:51Z

I have the feeling that the parsing will be "less solid" now 😬

danwinship · 2022-06-02T16:14:44Z

It seems strange to worry that the new code will be less correct than the old code, given that the old code was not even being tested against valid data...

danwinship · 2022-06-03T12:13:27Z

/retest-required

aojea · 2022-06-03T13:41:55Z

It seems strange to worry that the new code will be less correct than the old code, given that the old code was not even being tested against valid data...

ah, is a conclusion based on observation, we didn't have any bug with the old code ... at least, not any that we know :)

danwinship · 2022-06-13T17:17:43Z

Discussion from https://bugzilla.redhat.com/show_bug.cgi?id=2058444 which is unfortunately not public:

Dan Winship 2022-05-31 10:45:54 EDT Comment 45

Phil, can you confirm the results of these experiments against the actual code to make sure I'm not missing anything subtle?

When using either iptables-legacy or iptables-nft, the system does not keep track of chain-level counts for user-created chains

    # iptables -t filter -N FOO
    # iptables -t filter -A OUTPUT -j FOO
    # iptables -t filter -A FOO -d 1.2.3.4 -j REJECT

    # curl http://1.2.3.4/

    # iptables-save -t filter -c | grep FOO
    :FOO - [0:0]
    [53:9070] -A OUTPUT -j FOO
    [1:60] -A FOO -d 1.2.3.4/32 -j REJECT --reject-with icmp-port-unreachable

    # printf "*filter\n:FOO - [9:9999]\nCOMMIT\n" | iptables-restore -n -c
    # iptables-save -t filter -c | grep :FOO
    :FOO - [0:0]

ie, the rule has non-0 counters, but the chain always shows [0:0].

When using iptables-legacy, doing an "iptables-restore -n" (with or without "-c") will reset the counters of all chains in restored tables, including chains that are not mentioned in the restore:

    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [1626:374227]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-legacy-restore -n

    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [13:1626]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-legacy-restore -n -c

    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [11:1078]

When using iptables-nft, doing an "iptables-restore -n" (with or without "-c") WILL NOT reset the counters of chains that are not mentioned in the restore:

    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10313:2631314]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-nft-restore -n

    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10378:2638441]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-nft-restore -n -c

    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10386:2639973]

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)

Phil Sutter 2022-05-31 16:30:39 EDT Comment 46

Phil, can you confirm the results of these experiments against the actual code to make sure I'm not missing anything subtle?

When using either iptables-legacy or iptables-nft, the system does not keep track of chain-level counts for user-created chains

There is no such thing as "chain-level counts". In iptables nomenclature, these values are "policy counters" and are therefore relevant only with chains that support a policy - user-defined ones of course do not. The iptables-restore code is clear there: counter values are parsed only if '-c' was given and policy string is not '-'.

# iptables -t filter -N FOO
# iptables -t filter -A OUTPUT -j FOO
# iptables -t filter -A FOO -d 1.2.3.4 -j REJECT

# curl http://1.2.3.4/

# iptables-save -t filter -c | grep FOO
:FOO - [0:0]
[53:9070] -A OUTPUT -j FOO
[1:60] -A FOO -d 1.2.3.4/32 -j REJECT --reject-with icmp-port-unreachable

# printf "*filter\n:FOO - [9:9999]\nCOMMIT\n" | iptables-restore -n -c
# iptables-save -t filter -c | grep :FOO
:FOO - [0:0]

ie, the rule has non-0 counters, but the chain always shows [0:0].

This was true for OUTPUT chain as well if you added the rule to it instead: The chain counter covers packets "falling off the end of the chain" only - add a final ACCEPT rule and the counter will remain zero.

When using iptables-legacy, doing an "iptables-restore -n" (with or without "-c") will reset the counters of all chains in restored tables, including chains that are not mentioned in the restore:

Yes, I can confirm this and it seems to be a bug in libiptc.

When using iptables-nft, doing an "iptables-restore -n" (with or without "-c") WILL NOT reset the counters of chains that are not mentioned in the restore:

IIRC, 'iptables-nft-restore -n' caches only the bare minimum and therefore does not know what chains actually exist in kernel. Therefore it does not mangle them (for whatever reason).

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)

So (a) requires non-flushing restore to leave other chains' counters alone and is broken with iptables-legacy?

Dan Winship 2022-06-01 08:44:09 EDT Comment 47

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)

So (a) requires non-flushing restore to leave other chains' counters alone and is broken with iptables-legacy?

No, kube-proxy / kubernetes itself doesn't care about the counters. It's just trying to be a good citizen and not break the counters in case the user was using them for something else. At least... I think that's what it's doing? I haven't done the git archaeology to figure out exactly what the original intent of using "-c" was.

But regardless of what the intent was, the code is not actually doing anything useful; our iptables-restore input only ever contains user-defined chains, not built-in ones, so we never actually preserve any counters. We just do a bunch of extra work to keep track of the fact that all of our chains have "[0:0]" for their policy counters.

(The bug that resets the counters with iptables-legacy might be something that annoys some user somewhere, but AFAICT no one ever filed a bug report about it...)

pkg/proxy/ipvs/proxier.go

pkg/util/iptables/save_restore.go

aojea · 2022-06-28T10:33:52Z

LGTM
just the name of the function can be changed to indicate more clear the fact it assumes Tables only
/assign @thockin

The iptables and ipvs proxies have code to try to preserve certain iptables counters when modifying chains via iptables-restore, but the counters in question only actually exist for the built-in chains (eg INPUT, FORWARD, PREROUTING, etc), which we never modify via iptables-restore (and in fact, *can't* safely modify via iptables-restore), so we are really just doing a lot of unnecessary work to copy the constant string "[0:0]" over from iptables-save output to iptables-restore input. So stop doing that. Also fix a confused error message when iptables-save fails.

The test was calling GetChainLines() on invalid pseudo-iptables-save output where most of the lines were indented. GetChainLines() happened to still parse this "correctly", but it would be better to be testing it on actually-correct data.

We don't need to parse out the counter values from the iptables-save output (since they are always 0 for the chains we care about). Just parse the chain names themselves. Also, all of the callers of GetChainLines() pass it input that contains only a single table, so just assume that, rather than carefully parsing only a single table's worth of the input.

aojea · 2022-06-28T16:43:11Z

@danwinship seems we can stop passing utiliptables.RestoreCounters to the restore and use NoRestoreCounters

kubernetes/pkg/proxy/iptables/proxier.go

Line 1504 in a3da8d7

    
           err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)

danwinship · 2022-06-28T18:09:24Z

yeah... I didn't bother because it doesn't simplify the code any (it makes it two characters longer!)

I guess we could just remove the argument from the function entirely, but...

aojea · 2022-06-28T18:32:37Z

yeah... I didn't bother because it doesn't simplify the code any (it makes it two characters longer!)

I guess we could just remove the argument from the function entirely, but...

nah, I just found it , it can be a follow up

aojea · 2022-06-28T18:33:31Z

/lgtm

I think that this changes are ok to merge

aojea · 2022-06-29T14:02:13Z

/lgtm

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 1, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022

k8s-ci-robot requested review from bowei and dcbw June 1, 2022 15:37

k8s-ci-robot added the area/ipvs label Jun 1, 2022

danwinship force-pushed the iptables-counters branch 2 times, most recently from 58ea852 to af23807 Compare June 1, 2022 17:58

danwinship mentioned this pull request Jun 1, 2022

only clean up iptables chains periodically in large clusters #110334

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2022

danwinship force-pushed the iptables-counters branch from af23807 to 7624045 Compare June 24, 2022 12:26

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2022

aojea reviewed Jun 28, 2022

View reviewed changes

pkg/proxy/ipvs/proxier.go Outdated Show resolved Hide resolved

aojea reviewed Jun 28, 2022

View reviewed changes

pkg/util/iptables/save_restore.go Outdated Show resolved Hide resolved

aojea reviewed Jun 28, 2022

View reviewed changes

pkg/util/iptables/save_restore.go Show resolved Hide resolved

k8s-ci-robot assigned thockin Jun 28, 2022

danwinship added 2 commits June 28, 2022 08:39

Use dedent to fix GetChainLines() tests

4988699

The test was calling GetChainLines() on invalid pseudo-iptables-save output where most of the lines were indented. GetChainLines() happened to still parse this "correctly", but it would be better to be testing it on actually-correct data.

danwinship force-pushed the iptables-counters branch from 7624045 to 7c27cf0 Compare June 28, 2022 12:42

k8s-ci-robot assigned aojea Jun 29, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2022

k8s-ci-robot merged commit 0d9ed2c into kubernetes:master Jun 29, 2022

k8s-ci-robot added this to the v1.25 milestone Jun 29, 2022

danwinship deleted the iptables-counters branch June 29, 2022 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop trying to "preserve" iptables counters that are always 0 #110328

Stop trying to "preserve" iptables counters that are always 0 #110328

danwinship commented Jun 1, 2022 •

edited

Loading

k8s-ci-robot commented Jun 1, 2022

k8s-ci-robot commented Jun 1, 2022

aojea commented Jun 2, 2022

danwinship commented Jun 2, 2022

danwinship commented Jun 3, 2022

aojea commented Jun 3, 2022 •

edited

Loading

danwinship commented Jun 13, 2022

aojea commented Jun 28, 2022

aojea commented Jun 28, 2022 •

edited

Loading

danwinship commented Jun 28, 2022

aojea commented Jun 28, 2022

aojea commented Jun 28, 2022

aojea commented Jun 29, 2022

Stop trying to "preserve" iptables counters that are always 0 #110328

Stop trying to "preserve" iptables counters that are always 0 #110328

Conversation

danwinship commented Jun 1, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jun 1, 2022

k8s-ci-robot commented Jun 1, 2022

aojea commented Jun 2, 2022

danwinship commented Jun 2, 2022

danwinship commented Jun 3, 2022

aojea commented Jun 3, 2022 • edited Loading

danwinship commented Jun 13, 2022

aojea commented Jun 28, 2022

aojea commented Jun 28, 2022 • edited Loading

danwinship commented Jun 28, 2022

aojea commented Jun 28, 2022

aojea commented Jun 28, 2022

aojea commented Jun 29, 2022

danwinship commented Jun 1, 2022 •

edited

Loading

aojea commented Jun 3, 2022 •

edited

Loading

aojea commented Jun 28, 2022 •

edited

Loading