Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop trying to "preserve" iptables counters that are always 0 #110328

Merged
merged 3 commits into from
Jun 29, 2022

Conversation

danwinship
Copy link
Contributor

@danwinship danwinship commented Jun 1, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

@thockin had mentioned a while back that he was pretty sure that all iptables counters are always 0 and we're doing pointless busywork pretending otherwise, and this turns out to be mostly correct.

Specifically, the chain-level counters that our code deals with (as opposed to the rule-level counters that we ignore) are only used for the built-in chains (INPUT, PREROUTING, etc) that have a default policy, and they count the number of packets/bytes that hit that default policy (ie, that fall off the end of the chain rather than matching a terminal rule). For user-created chains, these counters will always be 0, and can't even manually be set to any other value (ie, if you try to set them in the iptables-restore input, the values you provide will be ignored). This is true for both iptables-legacy and iptables-nft.

We never pass iptables-restore any input that refers to the built-in chains, so all of our efforts to preserve the counters are pointless. And we don't have to worry about "but what if we wanted to change things in the future so that we did refer to those chains?" because there's no way we could refer to the built-in chains in our iptables-restore input without creating race conditions that might accidentally overwrite concurrent modifications to those chains by other users. (This is why the code still uses ipt.EnsureChain/ipt.EnsureRule for some things rather than doing everything with ipt.Restore.)

Special notes for your reviewer:

(After this PR, the only thing we use the iptables-save output for is to figure out which service/endpoint chains we need to delete. We could theoretically figure that out without using iptables-save, but that would require us to more-or-less keep a copy of the current known iptables state in memory at all times. And we'd still need the iptables-save version at startup time anyway. A better optimization would be to change it to only do stale-chain cleanup periodically (eg, once a minute) rather than on every resync; the stale chains wouldn't have any rules pointing to them, so they wouldn't cause any problems beyond memory usage. #110334 does that.)

Does this PR introduce a user-facing change?

NONE

/sig network
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 1, 2022
@k8s-ci-robot
Copy link
Contributor

@danwinship: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 1, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2022
@k8s-ci-robot k8s-ci-robot requested review from bowei and dcbw June 1, 2022 15:37
@danwinship danwinship force-pushed the iptables-counters branch 2 times, most recently from 58ea852 to af23807 Compare June 1, 2022 17:58
@aojea
Copy link
Member

aojea commented Jun 2, 2022

I have the feeling that the parsing will be "less solid" now 😬

@danwinship
Copy link
Contributor Author

It seems strange to worry that the new code will be less correct than the old code, given that the old code was not even being tested against valid data...

@danwinship
Copy link
Contributor Author

/retest-required

@aojea
Copy link
Member

aojea commented Jun 3, 2022

It seems strange to worry that the new code will be less correct than the old code, given that the old code was not even being tested against valid data...

ah, is a conclusion based on observation, we didn't have any bug with the old code ... at least, not any that we know :)

@danwinship
Copy link
Contributor Author

Discussion from https://bugzilla.redhat.com/show_bug.cgi?id=2058444 which is unfortunately not public:


Dan Winship 2022-05-31 10:45:54 EDT Comment 45

Phil, can you confirm the results of these experiments against the actual code to make sure I'm not missing anything subtle?

  1. When using either iptables-legacy or iptables-nft, the system does not keep track of chain-level counts for user-created chains
    # iptables -t filter -N FOO
    # iptables -t filter -A OUTPUT -j FOO
    # iptables -t filter -A FOO -d 1.2.3.4 -j REJECT

    # curl http://1.2.3.4/

    # iptables-save -t filter -c | grep FOO
    :FOO - [0:0]
    [53:9070] -A OUTPUT -j FOO
    [1:60] -A FOO -d 1.2.3.4/32 -j REJECT --reject-with icmp-port-unreachable

    # printf "*filter\n:FOO - [9:9999]\nCOMMIT\n" | iptables-restore -n -c
    # iptables-save -t filter -c | grep :FOO
    :FOO - [0:0]

ie, the rule has non-0 counters, but the chain always shows [0:0].

  1. When using iptables-legacy, doing an "iptables-restore -n" (with or without "-c") will reset the counters of all chains in restored tables, including chains that are not mentioned in the restore:
    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [1626:374227]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-legacy-restore -n

    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [13:1626]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-legacy-restore -n -c

    # iptables-legacy-save -t filter|grep INPUT
    :INPUT ACCEPT [11:1078]
  1. When using iptables-nft, doing an "iptables-restore -n" (with or without "-c") WILL NOT reset the counters of chains that are not mentioned in the restore:
    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10313:2631314]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-nft-restore -n

    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10378:2638441]

    # printf "*filter\n:FOO - [0:0]\n-A FOO -j DROP\nCOMMIT\n" | iptables-nft-restore -n -c

    # iptables-nft-save -t filter|grep INPUT
    :INPUT ACCEPT [10386:2639973]

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)


Phil Sutter 2022-05-31 16:30:39 EDT Comment 46

Phil, can you confirm the results of these experiments against the actual code to make sure I'm not missing anything subtle?

  1. When using either iptables-legacy or iptables-nft, the system does not keep track of chain-level counts for user-created chains

There is no such thing as "chain-level counts". In iptables nomenclature, these values are "policy counters" and are therefore relevant only with chains that support a policy - user-defined ones of course do not. The iptables-restore code is clear there: counter values are parsed only if '-c' was given and policy string is not '-'.

# iptables -t filter -N FOO
# iptables -t filter -A OUTPUT -j FOO
# iptables -t filter -A FOO -d 1.2.3.4 -j REJECT

# curl http://1.2.3.4/

# iptables-save -t filter -c | grep FOO
:FOO - [0:0]
[53:9070] -A OUTPUT -j FOO
[1:60] -A FOO -d 1.2.3.4/32 -j REJECT --reject-with icmp-port-unreachable

# printf "*filter\n:FOO - [9:9999]\nCOMMIT\n" | iptables-restore -n -c
# iptables-save -t filter -c | grep :FOO
:FOO - [0:0]

ie, the rule has non-0 counters, but the chain always shows [0:0].

This was true for OUTPUT chain as well if you added the rule to it instead: The chain counter covers packets "falling off the end of the chain" only - add a final ACCEPT rule and the counter will remain zero.

  1. When using iptables-legacy, doing an "iptables-restore -n" (with or without "-c") will reset the counters of all chains in restored tables, including chains that are not mentioned in the restore:

Yes, I can confirm this and it seems to be a bug in libiptc.

  1. When using iptables-nft, doing an "iptables-restore -n" (with or without "-c") WILL NOT reset the counters of chains that are not mentioned in the restore:

IIRC, 'iptables-nft-restore -n' caches only the bare minimum and therefore does not know what chains actually exist in kernel. Therefore it does not mangle them (for whatever reason).

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)

So (a) requires non-flushing restore to leave other chains' counters alone and is broken with iptables-legacy?


Dan Winship 2022-06-01 08:44:09 EDT Comment 47

(kube-proxy does "iptables-save" for two reasons: (a) so it can set counters correctly on restore, (b) so it can see if there are stale KUBE chains lying around that need to be deleted. But it seems like (a) actually never really worked, and we don't need to do (b) on every sync...)

So (a) requires non-flushing restore to leave other chains' counters alone and is broken with iptables-legacy?

No, kube-proxy / kubernetes itself doesn't care about the counters. It's just trying to be a good citizen and not break the counters in case the user was using them for something else. At least... I think that's what it's doing? I haven't done the git archaeology to figure out exactly what the original intent of using "-c" was.

But regardless of what the intent was, the code is not actually doing anything useful; our iptables-restore input only ever contains user-defined chains, not built-in ones, so we never actually preserve any counters. We just do a bunch of extra work to keep track of the fact that all of our chains have "[0:0]" for their policy counters.

(The bug that resets the counters with iptables-legacy might be something that annoys some user somewhere, but AFAICT no one ever filed a bug report about it...)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2022
pkg/proxy/ipvs/proxier.go Outdated Show resolved Hide resolved
@aojea
Copy link
Member

aojea commented Jun 28, 2022

LGTM
just the name of the function can be changed to indicate more clear the fact it assumes Tables only
/assign @thockin

The iptables and ipvs proxies have code to try to preserve certain
iptables counters when modifying chains via iptables-restore, but the
counters in question only actually exist for the built-in chains (eg
INPUT, FORWARD, PREROUTING, etc), which we never modify via
iptables-restore (and in fact, *can't* safely modify via
iptables-restore), so we are really just doing a lot of unnecessary
work to copy the constant string "[0:0]" over from iptables-save
output to iptables-restore input. So stop doing that.

Also fix a confused error message when iptables-save fails.
The test was calling GetChainLines() on invalid pseudo-iptables-save
output where most of the lines were indented. GetChainLines() happened
to still parse this "correctly", but it would be better to be testing
it on actually-correct data.
We don't need to parse out the counter values from the iptables-save
output (since they are always 0 for the chains we care about). Just
parse the chain names themselves.

Also, all of the callers of GetChainLines() pass it input that
contains only a single table, so just assume that, rather than
carefully parsing only a single table's worth of the input.
@aojea
Copy link
Member

aojea commented Jun 28, 2022

@danwinship seems we can stop passing utiliptables.RestoreCounters to the restore and use NoRestoreCounters

err = proxier.iptables.RestoreAll(proxier.iptablesData.Bytes(), utiliptables.NoFlushTables, utiliptables.RestoreCounters)

@danwinship
Copy link
Contributor Author

yeah... I didn't bother because it doesn't simplify the code any (it makes it two characters longer!)

I guess we could just remove the argument from the function entirely, but...

@aojea
Copy link
Member

aojea commented Jun 28, 2022

yeah... I didn't bother because it doesn't simplify the code any (it makes it two characters longer!)

I guess we could just remove the argument from the function entirely, but...

nah, I just found it , it can be a follow up

@aojea
Copy link
Member

aojea commented Jun 28, 2022

/lgtm

I think that this changes are ok to merge

@aojea
Copy link
Member

aojea commented Jun 29, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2022
@k8s-ci-robot k8s-ci-robot merged commit 0d9ed2c into kubernetes:master Jun 29, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jun 29, 2022
@danwinship danwinship deleted the iptables-counters branch June 29, 2022 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants