Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy min sync period #35334

Merged
merged 2 commits into from
Nov 4, 2016
Merged

Conversation

timothysc
Copy link
Member

@timothysc timothysc commented Oct 21, 2016

What this PR does / why we need it:
Gives the proxy the option to set a lower bound on the sync period when there are a high number of endpoint changes. This prevents excessive iptables re-writes under a number of conditions.

fixes #33693
and alleviates the symptoms of #26637

NOTE:
There are other minor fixes that I'm working on but keeping the PRs separate.

Release note:

Added iptables-min-syn-period(2) to proxy to prevent excessive iptables writes


This change is Reviewable

@timothysc timothysc added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/kube-proxy release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Oct 21, 2016
@timothysc timothysc added this to the v1.5 milestone Oct 21, 2016
@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 21, 2016
@timothysc
Copy link
Member Author

I'll fix up the minor stuff in a bit.

@timothysc
Copy link
Member Author

ug, the splitting of component config + go client is a mess.

@timothysc timothysc changed the title Proxy min sync period [WIP] Proxy min sync period Oct 24, 2016
@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided. Otherwise use the best-available proxy (currently iptables). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with. Must be within the range [0, 31].")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The Maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Maximum/maximum

@@ -80,6 +80,9 @@ func SetDefaults_KubeProxyConfiguration(obj *KubeProxyConfiguration) {
if obj.IPTablesSyncPeriod.Duration == 0 {
obj.IPTablesSyncPeriod = unversioned.Duration{Duration: 30 * time.Second}
}
if obj.IPTablesMinSyncPeriod.Duration == 0 {
obj.IPTablesMinSyncPeriod = unversioned.Duration{Duration: 10 * time.Second}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that changes in endpoints won't appear for 10 seconds? Too long. I was thinking 1-2 seconds max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as we have the knob to turn it down on large scale I'm ok with 2. Is there some SLO that we want to maintain @ scale (X)? /cc @gmarek @wojtek-t

@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided. Otherwise use the best-available proxy (currently iptables). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with. Must be within the range [0, 31].")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The Maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
fs.DurationVar(&s.IPTablesMinSyncPeriod.Duration, "iptables-min-sync-period", s.IPTablesMinSyncPeriod.Duration, "The Minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Minimum/minimum

@@ -495,7 +498,10 @@ func (proxier *Proxier) OnServiceUpdate(allServices []api.Service) {
}
}
}
proxier.syncProxyRules()
if expired := time.Since(proxier.lastSync); expired > proxier.minsyncPeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right. If I am 10ms short of being able to refresh, I now have to wait for another event or for the longer sync-period. Don't you want a timer to wake up in proxier.minsyncPeriod - expired ?

@@ -87,6 +87,7 @@ type Proxier struct {
mu sync.Mutex // protects serviceMap
serviceMap map[proxy.ServicePortName]*serviceInfo
syncPeriod time.Duration
minsyncPeriod time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/minsync/minSync/

@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 24, 2016
@timothysc timothysc changed the title [WIP] Proxy min sync period Proxy min sync period Oct 24, 2016
@timothysc
Copy link
Member Author

@thockin comments addressed, rebased, tests now passing.

@@ -77,7 +77,8 @@ func (s *ProxyServerConfig) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&s.HostnameOverride, "hostname-override", s.HostnameOverride, "If non-empty, will use this string as identification instead of the actual hostname.")
fs.Var(&s.Mode, "proxy-mode", "Which proxy mode to use: 'userspace' (older) or 'iptables' (faster). If blank, look at the Node object on the Kubernetes API and respect the '"+ExperimentalProxyModeAnnotation+"' annotation if provided. Otherwise use the best-available proxy (currently iptables). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.")
fs.Int32Var(s.IPTablesMasqueradeBit, "iptables-masquerade-bit", util.Int32PtrDerefOr(s.IPTablesMasqueradeBit, 14), "If using the pure iptables proxy, the bit of the fwmark space to mark packets requiring SNAT with. Must be within the range [0, 31].")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "How often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
fs.DurationVar(&s.IPTablesSyncPeriod.Duration, "iptables-sync-period", s.IPTablesSyncPeriod.Duration, "The maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
fs.DurationVar(&s.IPTablesMinSyncPeriod.Duration, "iptables-min-sync-period", s.IPTablesMinSyncPeriod.Duration, "The minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m'). Must be greater than 0.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this is > iptables-sync-period? Comment should say it can';t be and should be validated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic check.

glog.V(6).Infof("Periodic sync")
proxier.Sync()
proxier.timer.Reset(proxier.syncPeriod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments say proxier.timer is mutex protected. This doesn't take the mutex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

proxier.syncProxyRules()
} else if proxier.timer != nil {
remaining := proxier.minSyncPeriod - expired
glog.V(4).Infof("Service update resetting synch period %v", remaining)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/synch/sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derp.

@@ -179,6 +180,7 @@ func createProxier(loadBalancer LoadBalancer, listenIP net.IP, iptables iptables
serviceMap: make(map[proxy.ServicePortName]*serviceInfo),
portMap: make(map[portMapKey]*portMapValue),
syncPeriod: syncPeriod,
minSyncPeriod: minSyncPeriod,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plumbed if needed, but I should probably commend to denote.

@thockin
Copy link
Member

thockin commented Oct 24, 2016

Here's what is going to happen. A Service update (create) will arrive. We sync it. Then an Endpoints update will arrive very soon thereafter. It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

We could do a semaphore thing and start with something like 3 tokens. Add a new token every interval up to max 3. This would give low-load systems enough freedom to enact changes right away and would bandwidth-limit high-load systems.

I'm just spitballing here...

@timothysc
Copy link
Member Author

timothysc commented Oct 25, 2016

It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

It will add 2 seconds to the last service, not every, as they will be batched, but I'm not terribly concerned about that. I'm far more concerned about controlling the thrashing at this point, and this puts an simple control knob on that.

We could do a semaphore thing and start with something like 3 tokens. Add a new token every interval up to max 3. This would give low-load systems enough freedom to enact changes right away and would bandwidth-limit high-load systems.

I'd like to profile before adding any more complexity. For us, the largest concern is the high amount of churn on large clusters, not the update-SLO on endpoints. In profiling, the longest delay is typically on the docker pull which is orders of magnitude greater then this delay. So keeping it simple makes the most sense, unless there is a use case that I'm missing.?.?

other comments have been addressed.

@timothysc timothysc force-pushed the proxy_min_sync branch 2 times, most recently from e76e357 to e84aa4e Compare October 25, 2016 13:17
@thockin
Copy link
Member

thockin commented Oct 25, 2016

It will get stuck behind this delay. In practice, this adds at least 2 seconds to every Service creation. More if anyone tweaks that flag. Maybe not a huge deal, but worth thinking about. Can we do better?

It will add 2 seconds to the last service, not every, as they will be batched, but I'm not terribly concerned about that. I'm far more concerned about controlling the thrashing at this point, and this puts an simple control knob on that.

Maybe I am misunderstanding the flow. The Service create will trigger OnServiceUpdate(), which will sync. That sets the timestamp. 10ms later, the Endpoints update will trigger OnEndpointsUpdate(), which will set a timer for 1990 ms later. If the normal pattern is these two operations being highly temporally coupled, it seems we should design for that case, no?

@timothysc
Copy link
Member Author

timothysc commented Oct 25, 2016

10ms later, the Endpoints update will trigger OnEndpointsUpdate(), which will set a timer for 1990 ms later. If the normal pattern is these two operations being highly temporally coupled, it seems we should design for that case, no?

In a slow churn single operator environment, perhaps.
Ours is a high-churn, dense, multi-tenant environment, and tuning for that case doesn't make a lot of sense. Because within that window there are literally 100s or 1000s of updates. As daniel points out here: #26637 (comment)

@dcbw
Copy link
Member

dcbw commented Oct 25, 2016

@timothysc @thockin I did a patch yesterday to cut down on non-important Service/Endpoint changes, which does seem to reduce the need for resyncs somewhat. I'm currently looking into how to do an iptables diff to figure out if we really need to resync for the periodic loop, but that's a lot harder as Kube doesn't write rules in the same way as iptables-save reads them back, so we effectively have to create an iptables-save rule parser.

I know Tim said at one point that "by the time we get to iptables-restore the work is all pretty much done", but that's not the case. The Go side is fine, but it's the kernel contention of iptables-restore that is the problem for most of our customers. When I benchmarked with 4.7 kernels and 6 or 8000 iptables rules, I was getting hundreds of ms runtime for a single iptables-restore. The Go-side time for generating those rules was incidental. So I think we can get some good mileage from simply not calling iptables-restore when we don't need to.

@k8s-github-robot k8s-github-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2016
@thockin thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GKE smoke e2e failed for commit 1cb97b8. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit dd53b74 into kubernetes:master Nov 4, 2016
@wojtek-t
Copy link
Member

wojtek-t commented Nov 5, 2016

@k8s-oncall @thockin @timothysc - it seems that this PR broke gke-slow suite (it's mostly red since then and nothing is merging)

@wojtek-t
Copy link
Member

wojtek-t commented Nov 5, 2016

@saad-ali ^^

@saad-ali
Copy link
Member

saad-ali commented Nov 5, 2016

@bprashanth I merged your PR #36282. Does this need to be reverted as well?

@wojtek-t
Copy link
Member

wojtek-t commented Nov 5, 2016

@saad-ali - I didn't see that PR, maybe that one actually fix the problem.

@timothysc
Copy link
Member Author

Looks like #36282 just needs a timing shift, which makes sense, but I'm wondering how the assortment of other suites didn't catch this?

@bprashanth
Copy link
Contributor

It's still very likely this broke slow suite. My pr was supposed to be a short term mitigation till we've had a chance to debug #36281 (suite is still flaking), @MrHohn

@saad-ali
Copy link
Member

saad-ali commented Nov 5, 2016

Yes, looks like PR #36282 took care of it. https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/kubernetes-e2e-gci-gke-slow/ has been green since.

@bprashanth
Copy link
Contributor

are you sure? akf right now but last I checked it flaked this morning and the pr went in last night. I just don't want to cry wolf without evidence :)

@saad-ali
Copy link
Member

saad-ali commented Nov 5, 2016

There are multiple slow suites, the one I looked at, kubernetes-e2e-gci-gke-slow, has been green since PR #36282 went in at 12:37 AM PDT. Same with kubernetes-e2e-gce-slow. However, kubernetes-e2e-gke-slow failed as recently as 2016-11-05 1:23 PM PDT. So it's possible PR #36282 just reduced the frequency of failure, but need a smoking gun to revert this PR.

@bprashanth
Copy link
Contributor

Did I just forget to read time? I see 8 flakes since 12:30 AM last night.
This pr is surely contributing to the janky performance observed in #36281

Because of the way kubeproxy is structured, we're inserting a token into the bucket every 2 seconds for the iptables update, and sending a no-op update down the watch every second (2 every 2 seconds, scheduler and kube-controller-manager). So we end up filling the queue with no-op endpoint updates and blocking an interesting service update.

@timothysc if you're interested in doing this right you should look at the watch processing pipeline. This pr calls Accept() holding the syncProxyRules lock (https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L782, https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L422), which is effective a time.Sleep(1 or 2 seconds) given what I described above. The on update functions receive a snapshot of endpoints/services on every update. Meaning:

t:1 (bucket empty)
no-op update
no-op snapshot
lock
accept () <- sleep(2)

next no-op update
next no-op snapshot
lock <- block

svc update
svc snapshot
lock <- block

t:3 (bucket has a token)
t:1 process no-op snapshot
next no-op update
next no-op snapshot
lock <- block

this can essentially go on forever just processing no-op updates from the endpoint handler because go's locks aren't fair. Note that the "burst" will only help if you actually get 0 updates for a while such that you accumulate enough tokens.

What we were doing previous is also bad, but the no-op updates would run asap leaving the channel empty for more important updates.

I will probably send a pr to default it to the old behavior soonish. We can fix this as a bug during code freeze.

@bprashanth
Copy link
Contributor

Ah, @MrHohn has empirical evidence #36281 (comment)

dcbw added a commit to dcbw/kubernetes that referenced this pull request Nov 9, 2016
iptables-restore is a very heavy operation and depending on the kernel,
the CPU, and the number of rules to restore, can take a very long time
(~500ms or more).

i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6)
i7-4790  @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7)

Other parts of kubernetes (eg kubenet) might be running iptables-restore
too, leading to some pileup as each iptables-restore waits for others to
complete.

Related: kubernetes#26637
Related: kubernetes#33693
Related: kubernetes#35334
dcbw added a commit to dcbw/kubernetes that referenced this pull request Nov 18, 2016
iptables-restore is a very heavy operation and depending on the kernel,
the CPU, and the number of rules to restore, can take a very long time
(~500ms or more).

i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6)
i7-4790  @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7)

Other parts of kubernetes (eg kubenet) might be running iptables-restore
too, leading to some pileup as each iptables-restore waits for others to
complete.

Related: kubernetes#26637
Related: kubernetes#33693
Related: kubernetes#35334
dcbw added a commit to dcbw/kubernetes that referenced this pull request Dec 1, 2016
iptables-restore is a very heavy operation and depending on the kernel,
the CPU, and the number of rules to restore, can take a very long time
(~500ms or more).

i7-5600U @ 2.6GHz: 700ms for 1000 services (2+4 cores+threads, kernel 4.6.6)
i7-4790  @ 3.6GHz: 270ms for 1000 services (4+8 cores+threads, kernel 4.6.7)

Other parts of kubernetes (eg kubenet) might be running iptables-restore
too, leading to some pileup as each iptables-restore waits for others to
complete.

Related: kubernetes#26637
Related: kubernetes#33693
Related: kubernetes#35334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kube-proxy kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bulk service or endpoint operations are computationally expsensive
10 participants