Proxy min sync period #35334

timothysc · 2016-10-21T21:57:24Z

What this PR does / why we need it:
Gives the proxy the option to set a lower bound on the sync period when there are a high number of endpoint changes. This prevents excessive iptables re-writes under a number of conditions.

fixes #33693
and alleviates the symptoms of #26637

NOTE:
There are other minor fixes that I'm working on but keeping the PRs separate.

Release note:

Added iptables-min-syn-period(2) to proxy to prevent excessive iptables writes

This change is

timothysc · 2016-10-21T22:40:51Z

I'll fix up the minor stuff in a bit.

timothysc · 2016-10-24T13:27:52Z

ug, the splitting of component config + go client is a mess.

thockin · 2016-10-24T14:23:12Z

cmd/kube-proxy/app/options/options.go

@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
 	fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
 	fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided.  Otherwise use the best-available proxy (currently iptables).  If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
 	fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with.  Must be within the range [0, 31].")
-	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")
+	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The Maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")


s/Maximum/maximum

thockin · 2016-10-24T14:24:03Z

pkg/apis/componentconfig/v1alpha1/defaults.go

@@ -80,6 +80,9 @@ func SetDefaults_KubeProxyConfiguration(obj *KubeProxyConfiguration) {
 	if obj.IPTablesSyncPeriod.Duration == 0 {
 		obj.IPTablesSyncPeriod = unversioned.Duration{Duration: 30 * time.Second}
 	}
+	if obj.IPTablesMinSyncPeriod.Duration == 0 {
+		obj.IPTablesMinSyncPeriod = unversioned.Duration{Duration: 10 * time.Second}


This means that changes in endpoints won't appear for 10 seconds? Too long. I was thinking 1-2 seconds max.

So long as we have the knob to turn it down on large scale I'm ok with 2. Is there some SLO that we want to maintain @ scale (X)? /cc @gmarek @wojtek-t

thockin · 2016-10-24T14:29:21Z

cmd/kube-proxy/app/options/options.go

@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
 	fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
 	fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided.  Otherwise use the best-available proxy (currently iptables).  If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
 	fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with.  Must be within the range [0, 31].")
-	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")
+	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The Maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")
+	fs.DurationVar(&s.IPTablesMinSyncPeriod.Duration, "iptables-min-sync-period", s.IPTablesMinSyncPeriod.Duration, "The Minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")


s/Minimum/minimum

thockin · 2016-10-24T18:45:44Z

pkg/proxy/iptables/proxier.go

@@ -495,7 +498,10 @@ func (proxier *Proxier) OnServiceUpdate(allServices []api.Service) {
 			}
 		}
 	}
-	proxier.syncProxyRules()
+	if expired := time.Since(proxier.lastSync); expired > proxier.minsyncPeriod {


This isn't right. If I am 10ms short of being able to refresh, I now have to wait for another event or for the longer sync-period. Don't you want a timer to wake up in proxier.minsyncPeriod - expired ?

thockin · 2016-10-24T18:46:42Z

pkg/proxy/userspace/proxier.go

@@ -87,6 +87,7 @@ type Proxier struct {
 	mu             sync.Mutex // protects serviceMap
 	serviceMap     map[proxy.ServicePortName]*serviceInfo
 	syncPeriod     time.Duration
+	minsyncPeriod  time.Duration


s/minsync/minSync/

timothysc · 2016-10-24T21:49:02Z

@thockin comments addressed, rebased, tests now passing.

thockin · 2016-10-24T22:07:06Z

cmd/kube-proxy/app/options/options.go

@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
 	fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
 	fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided.  Otherwise use the best-available proxy (currently iptables).  If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
 	fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with.  Must be within the range [0, 31].")
-	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")
+	fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")
+	fs.DurationVar(&s.IPTablesMinSyncPeriod.Duration, "iptables-min-sync-period", s.IPTablesMinSyncPeriod.Duration, "The minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m').  Must be greater than 0.")


What if this is > iptables-sync-period? Comment should say it can';t be and should be validated

Added logic check.

thockin · 2016-10-24T22:12:54Z

pkg/proxy/iptables/proxier.go

 		glog.V(6).Infof("Periodic sync")
 		proxier.Sync()
+		proxier.timer.Reset(proxier.syncPeriod)


comments say proxier.timer is mutex protected. This doesn't take the mutex.

thockin · 2016-10-24T22:14:01Z

pkg/proxy/iptables/proxier.go

+		proxier.syncProxyRules()
+	} else if proxier.timer != nil {
+		remaining := proxier.minSyncPeriod - expired
+		glog.V(4).Infof("Service update resetting synch period %v", remaining)


s/synch/sync

thockin · 2016-10-24T22:16:05Z

pkg/proxy/userspace/proxier.go

@@ -179,6 +180,7 @@ func createProxier(loadBalancer LoadBalancer, listenIP net.IP, iptables iptables
 		serviceMap:     make(map[proxy.ServicePortName]*serviceInfo),
 		portMap:        make(map[portMapKey]*portMapValue),
 		syncPeriod:     syncPeriod,
+		minSyncPeriod:  minSyncPeriod,


never used?

plumbed if needed, but I should probably commend to denote.

thockin · 2016-10-24T22:21:02Z

Here's what is going to happen. A Service update (create) will arrive. We sync it. Then an Endpoints update will arrive very soon thereafter. It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

We could do a semaphore thing and start with something like 3 tokens. Add a new token every interval up to max 3. This would give low-load systems enough freedom to enact changes right away and would bandwidth-limit high-load systems.

I'm just spitballing here...

timothysc · 2016-10-25T02:09:19Z

It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

It will add 2 seconds to the last service, not every, as they will be batched, but I'm not terribly concerned about that. I'm far more concerned about controlling the thrashing at this point, and this puts an simple control knob on that.

We could do a semaphore thing and start with something like 3 tokens. Add a new token every interval up to max 3. This would give low-load systems enough freedom to enact changes right away and would bandwidth-limit high-load systems.

I'd like to profile before adding any more complexity. For us, the largest concern is the high amount of churn on large clusters, not the update-SLO on endpoints. In profiling, the longest delay is typically on the docker pull which is orders of magnitude greater then this delay. So keeping it simple makes the most sense, unless there is a use case that I'm missing.?.?

other comments have been addressed.

thockin · 2016-10-25T15:11:15Z

It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

It will add 2 seconds to the last service, not every, as they will be batched, but I'm not terribly concerned about that. I'm far more concerned about controlling the thrashing at this point, and this puts an simple control knob on that.

Maybe I am misunderstanding the flow. The Service create will trigger OnServiceUpdate(), which will sync. That sets the timestamp. 10ms later, the Endpoints update will trigger OnEndpointsUpdate(), which will set a timer for 1990 ms later. If the normal pattern is these two operations being highly temporally coupled, it seems we should design for that case, no?

timothysc · 2016-10-25T15:40:31Z

10ms later, the Endpoints update will trigger OnEndpointsUpdate(), which will set a timer for 1990 ms later. If the normal pattern is these two operations being highly temporally coupled, it seems we should design for that case, no?

In a slow churn single operator environment, perhaps.
Ours is a high-churn, dense, multi-tenant environment, and tuning for that case doesn't make a lot of sense. Because within that window there are literally 100s or 1000s of updates. As daniel points out here: #26637 (comment)

dcbw · 2016-10-25T16:59:44Z

@timothysc @thockin I did a patch yesterday to cut down on non-important Service/Endpoint changes, which does seem to reduce the need for resyncs somewhat. I'm currently looking into how to do an iptables diff to figure out if we really need to resync for the periodic loop, but that's a lot harder as Kube doesn't write rules in the same way as iptables-save reads them back, so we effectively have to create an iptables-save rule parser.

I know Tim said at one point that "by the time we get to iptables-restore the work is all pretty much done", but that's not the case. The Go side is fine, but it's the kernel contention of iptables-restore that is the problem for most of our customers. When I benchmarked with 4.7 kernels and 6 or 8000 iptables rules, I was getting hundreds of ms runtime for a single iptables-restore. The Go-side time for generating those rules was incidental. So I think we can get some good mileage from simply not calling iptables-restore when we don't need to.

k8s-ci-robot · 2016-11-04T17:48:34Z

Jenkins GCI GKE smoke e2e failed for commit 1cb97b8. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-11-04T20:27:45Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2016-11-04T21:06:02Z

Automatic merge from submit-queue

wojtek-t · 2016-11-05T06:46:21Z

@k8s-oncall @thockin @timothysc - it seems that this PR broke gke-slow suite (it's mostly red since then and nothing is merging)

wojtek-t · 2016-11-05T07:16:21Z

@saad-ali ^^

saad-ali · 2016-11-05T08:05:35Z

@bprashanth I merged your PR #36282. Does this need to be reverted as well?

wojtek-t · 2016-11-05T08:28:14Z

@saad-ali - I didn't see that PR, maybe that one actually fix the problem.

timothysc · 2016-11-05T20:20:19Z

Looks like #36282 just needs a timing shift, which makes sense, but I'm wondering how the assortment of other suites didn't catch this?

bprashanth · 2016-11-05T20:50:52Z

It's still very likely this broke slow suite. My pr was supposed to be a short term mitigation till we've had a chance to debug #36281 (suite is still flaking), @MrHohn

saad-ali · 2016-11-05T21:42:40Z

Yes, looks like PR #36282 took care of it. https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/ has been green since.

bprashanth · 2016-11-05T21:45:35Z

are you sure? akf right now but last I checked it flaked this morning and the pr went in last night. I just don't want to cry wolf without evidence :)

saad-ali · 2016-11-05T22:24:53Z

There are multiple slow suites, the one I looked at, kubernetes-e2e-gci-gke-slow, has been green since PR #36282 went in at 12:37 AM PDT. Same with kubernetes-e2e-gce-slow. However, kubernetes-e2e-gke-slow failed as recently as 2016-11-05 1:23 PM PDT. So it's possible PR #36282 just reduced the frequency of failure, but need a smoking gun to revert this PR.

bprashanth · 2016-11-06T02:27:06Z

Did I just forget to read time? I see 8 flakes since 12:30 AM last night.
This pr is surely contributing to the janky performance observed in #36281

Because of the way kubeproxy is structured, we're inserting a token into the bucket every 2 seconds for the iptables update, and sending a no-op update down the watch every second (2 every 2 seconds, scheduler and kube-controller-manager). So we end up filling the queue with no-op endpoint updates and blocking an interesting service update.

@timothysc if you're interested in doing this right you should look at the watch processing pipeline. This pr calls Accept() holding the syncProxyRules lock (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L782, https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L422), which is effective a time.Sleep(1 or 2 seconds) given what I described above. The on update functions receive a snapshot of endpoints/services on every update. Meaning:

t:1 (bucket empty)
no-op update
no-op snapshot
lock
accept () <- sleep(2)

next no-op update
next no-op snapshot
lock <- block

svc update
svc snapshot
lock <- block

t:3 (bucket has a token)
t:1 process no-op snapshot
next no-op update
next no-op snapshot
lock <- block

this can essentially go on forever just processing no-op updates from the endpoint handler because go's locks aren't fair. Note that the "burst" will only help if you actually get 0 updates for a while such that you accumulate enough tokens.

What we were doing previous is also bad, but the no-op updates would run asap leaving the channel empty for more important updates.

I will probably send a pr to default it to the old behavior soonish. We can fix this as a bug during code freeze.

bprashanth · 2016-11-06T02:27:43Z

Ah, @MrHohn has empirical evidence #36281 (comment)

iptables-restore is a very heavy operation and depending on the kernel, the CPU, and the number of rules to restore, can take a very long time (~500ms or more). i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6) i7-4790 @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7) Other parts of kubernetes (eg kubenet) might be running iptables-restore too, leading to some pileup as each iptables-restore waits for others to complete. Related: kubernetes#26637 Related: kubernetes#33693 Related: kubernetes#35334

timothysc added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/kube-proxy release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Oct 21, 2016

timothysc added this to the v1.5 milestone Oct 21, 2016

timothysc assigned thockin and dcbw Oct 21, 2016

googlebot added the cla: yes label Oct 21, 2016

timothysc assigned wojtek-t Oct 21, 2016

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 21, 2016

timothysc changed the title ~~Proxy min sync period~~ [WIP] Proxy min sync period Oct 24, 2016

thockin requested changes Oct 24, 2016

View reviewed changes

timothysc force-pushed the proxy_min_sync branch from 2474c7c to a293382 Compare October 24, 2016 20:37

k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2016

timothysc changed the title ~~[WIP] Proxy min sync period~~ Proxy min sync period Oct 24, 2016

thockin reviewed Oct 24, 2016

View reviewed changes

timothysc force-pushed the proxy_min_sync branch 2 times, most recently from e76e357 to e84aa4e Compare October 25, 2016 13:17

k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2016

thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2016

timothysc force-pushed the proxy_min_sync branch from 67ad40e to 1cb97b8 Compare November 4, 2016 17:30

k8s-github-robot merged commit dd53b74 into kubernetes:master Nov 4, 2016

bprashanth mentioned this pull request Nov 6, 2016

Investigate kube-proxy sluggishness #36281

Closed

MrHohn mentioned this pull request Nov 6, 2016

Uses a pair of locks as workaround to lock unfairness #36320

Closed

timothysc mentioned this pull request Nov 30, 2016

[kube-proxy] Fix for non-blocking updates during min-sync-period #37726

Closed

chentao1596 mentioned this pull request Dec 5, 2016

WIP:kubelet: support multi-headers when getting pod from HTTP source #38089

Closed

Proxy min sync period #35334

Proxy min sync period #35334

Conversation

timothysc commented Oct 21, 2016 • edited Loading

timothysc commented Oct 21, 2016

timothysc commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothysc commented Oct 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Oct 24, 2016

timothysc commented Oct 25, 2016 • edited Loading

thockin commented Oct 25, 2016

timothysc commented Oct 25, 2016 • edited Loading

dcbw commented Oct 25, 2016

k8s-ci-robot commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

wojtek-t commented Nov 5, 2016

wojtek-t commented Nov 5, 2016

saad-ali commented Nov 5, 2016

wojtek-t commented Nov 5, 2016

timothysc commented Nov 5, 2016

bprashanth commented Nov 5, 2016

saad-ali commented Nov 5, 2016

bprashanth commented Nov 5, 2016

saad-ali commented Nov 5, 2016

bprashanth commented Nov 6, 2016

bprashanth commented Nov 6, 2016

timothysc commented Oct 21, 2016 •

edited

Loading

timothysc commented Oct 25, 2016 •

edited

Loading

timothysc commented Oct 25, 2016 •

edited

Loading