-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimize iptables-restore input #110268
minimize iptables-restore input #110268
Conversation
1848584
to
9c1951c
Compare
9c1951c
to
878960a
Compare
naive question, is not possible to do the diff of the iptables data buffer directly? |
The distinction isn't quite a "diff". Eg, if 1 rule in a chain changes, we have to send the whole chain, not just the single changed rule. We could compare the old input against the new input and figure out which chains we need to include... but that would require having two copies of the complete iptables output, which would be a bunch of memory, and it wouldn't be especially efficient to do it against the stringified form because the output isn't sorted... Are there any particular things you don't like about the way the PR does it? I think the |
878960a
to
d86b119
Compare
d86b119
to
336a159
Compare
336a159
to
63dbc11
Compare
// IptablesPartialRestoreFailuresTotal is the number of iptables *partial* restore | ||
// failures (resulting in a fall back to a full restore) that the proxy has seen. | ||
IptablesPartialRestoreFailuresTotal = metrics.NewCounter( | ||
&metrics.CounterOpts{ | ||
Subsystem: kubeProxySubsystem, | ||
Name: "sync_proxy_rules_iptables_partial_restore_failures_total", | ||
Help: "Cumulative proxy iptables partial restore failures", | ||
StabilityLevel: metrics.ALPHA, | ||
}, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is easier if we label existing metric
diff --git a/pkg/proxy/metrics/metrics.go b/pkg/proxy/metrics/metrics.go
index f61cf4fc991..6fb2089bdc7 100644
--- a/pkg/proxy/metrics/metrics.go
+++ b/pkg/proxy/metrics/metrics.go
@@ -117,13 +117,14 @@ var (
// IptablesRestoreFailuresTotal is the number of iptables restore failures that the proxy has
// seen.
- IptablesRestoreFailuresTotal = metrics.NewCounter(
+ IptablesRestoreFailuresTotal = metrics.NewCounterVec(
&metrics.CounterOpts{
Subsystem: kubeProxySubsystem,
Name: "sync_proxy_rules_iptables_restore_failures_total",
Help: "Cumulative proxy iptables restore failures",
StabilityLevel: metrics.ALPHA,
},
+ []string{"partial_sync", "full_sync"},
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that an API break of the existing metric? (I mean, ok, StabilityLevel: metrics.ALPHA
, but still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't know, I thought it was compatible, but better ask an expert
@dgrisonnet can you please advise? Is adding labels to an existing metric backwards compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan can you help us with this question? is adding a label to a metric without labels compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a breaking change and you should probably amend the release note accordingly. But it is allowed since the metric is only Alpha
.
// Called by the iptables.Monitor, and in response to topology changes; this calls | ||
// syncProxyRules() and tells it to resync all services, regardless of whether the | ||
// Service or Endpoints/EndpointSlice objects themselves have changed | ||
func (proxier *Proxier) forceSyncProxyRules() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !proxier.largeClusterMode || time.Since(proxier.lastIPTablesCleanup) > proxier.syncPeriod {
I think line1393 now has to switch to this fullSync
condition, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't; activeNATChains
gets filled in correctly even for the chains we don't actually update (that's much of what #110266 did) so we can compare that against the iptables-save
data and get the right set of chains to delete.
(Though this is certainly something that we'll have to test carefully.)
pkg/proxy/iptables/proxier.go
Outdated
defer func() { | ||
if !success { | ||
klog.InfoS("Sync failed", "retryingTime", proxier.syncPeriod) | ||
proxier.syncRunner.RetryAfter(proxier.syncPeriod) | ||
if didPartialSync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need didPartialSync
?
should not be enough to check if !proxier.needFullSync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... if we did that, it would be the same, except that if there were 0 services, or if every service changed since the last sync (and then the sync failed), it would mistakenly call that a failed partial sync.
Anyway, I think the explicit variable makes it clearer anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we want to force sync no matter what if success is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If
!success && needFullSync
then we don't need to setneedFullSync=true
because it's alreadytrue
. - If
!success && !needFullSync && didPartialSync
then existing code setsneedFullSync=true
. - If
!success && !needFullSync && !didPartialSync
then, as above, that implies that either there are 0 services, or else every service changed, and we don't currently setneedFullSync=true
:- If there were 0 services, then the next time we sync, either there will still be 0 services (in which case it doesn't matter whether we do a "full" or "partial" sync) or else there will be non-0 services (in which case all of the services will be in the
serviceChanged
set and so we will sync all of them regardless of whether we're doing a "full" or "partial" sync). So it doesn't matter whather we setneedFullSync=true
or not. - If there were non-0 services and they had all changed, then the next time we sync... Oof. The next time we sync
serviceChanged
andendpointsChanged
will both be empty because the trackers get reset on every sync whether or not the sync succeeds. So yes, in that case, we need to forceneedFullySync=true
!
- If there were 0 services, then the next time we sync, either there will still be 0 services (in which case it doesn't matter whether we do a "full" or "partial" sync) or else there will be non-0 services (in which case all of the services will be in the
Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug; if a service or endpoint changes, and we need to do conntrack cleanup as a result, but then the next iptables-restore
fails, then we will lose track of the fact that we needed to do that conntrack cleanup...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, it might make more sense to fix the service/endpoint tracking, because the conntrack-cleanup code suffers from this same bug
Filed #112604. It seems non-trivial to fix. So I'll just fall back to doing a full resync in this case.
e8f87bd
to
48ceab1
Compare
iptables-restore requires that if you change any rule in a chain, you have to rewrite the entire chain. But if you avoid mentioning a chain at all, it will leave it untouched. Take advantage of this by not rewriting the SVC, SVL, EXT, FW, and SEP chains for services that have not changed since the last sync, which should drastically cut down on the size of each iptables-restore in large clusters.
48ceab1
to
818de5a
Compare
I looked this over and it all seemed reasonable to me (as someone not familiar with the existing codebase). |
Thanks! /lgtm Antonio - any last words? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, dgrisonnet, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel 😄 |
I've seen a huge improvement in network programming latency for our scalability tests: And I initially thought that it would be this PR, but the timing doesn't match - the first good data point is from Nov 10th, so ~week after this got merged. @danwinship @aojea @thockin - any thoughts why: |
heh, it seems I've responded myself. The FG was originally disabled in our tests, and it god enabled by this PR: The timing matches. So the summary is - this is great improvement! We're roughly 2x better on all percentiles. |
FTR this is the first test of the patch at scale; I did proof-of-concept testing to convince myself that it would actually speed things up, but not with an actual kubernetes cluster with tens of thousands of actual services. I don't know exactly what the perf tests do, but the improvement should be greatest in the case where there are a steady stream of small changes, which kube-proxy is processing more-or-less immediately. Eg, a steady-state cluster where you aren't regularly creating or deleting very large numbers of pods or services, you're just trying to deal with the fact that individual pods may come and go, services get scaled up and down, etc, occasionally a whole node may go away, etc. So in particular, if the perf-test is both (a) doing lots of changes very close together, and (b) setting a "large" Additionally, kube-proxy currently does a handful of non- |
To some extent it is what the test is doing. Or to be more specific - it's not really steady state, because we're pushing a significant churn of pods against the cluster (up to 100pods/s), but it's a very small fraction of all pods in the cluster (150,000 pods), so given that I would say it's mostly satisfying your description.
That's a good point - thanks for looking into it. |
What type of PR is this?
/kind feature
?
What this PR does / why we need it:
Currently, kube-proxy rewrites every iptables rule on every resync. In very large clusters, this ends up being a very large number of rules, and
iptables-restore
is not very efficient.Most of the time, most of the rules that kube-proxy rewrites have not changed since the last sync; generally speaking, some small number of services will have gained or lost one endpoint since the last sync, and the other 10,000 minus N services have exactly the same iptables rules as they did last time.
iptables-restore
doesn't require you to rewrite every chain; it just enforces the rule that if you want to update any rule in a chain, you have to update every rule in that chain. But if a chain doesn't need to change at all, you can just not include any rules for it, andiptables-restore
will leave it untouched.So, following up on #110266, which reorganized the internal
syncProxyRules
loop logic, this fixes the loop so that if a service hasn't changed since the last sync, then we don't rewrite its rules. (With this PR, we still rewrite all ofKUBE-SERVICES
/KUBE-EXTERNAL-SERVICES
/KUBE-NODEPORTS
on every sync, though in theory we could avoid doing that if only endpoints changed.)If we somehow get out of sync and, eg, fail to create a chain we should have created and then forget that we failed, this will be caught on the next sync. (
KUBE-SERVICES
will contain a rule to jump to the non-existent chain, which will cause theiptables-restore
to fail, which will cause the proxier to force a full resync.)Which issue(s) this PR fixes:
none
Special notes for your reviewer:
The branch includes both #110266 and #109844, both of which should merge (in either order) before this merges. Only the last two commits are new to this branch.
Does this PR introduce a user-facing change?
/sig network
/priority important-longterm