Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed integer overflow bug in rate limiter. #31396

Merged
merged 1 commit into from
Aug 25, 2016

Conversation

foxish
Copy link
Contributor

@foxish foxish commented Aug 25, 2016

Fix overflow issue in controller-manager rate limiter

This PR fixes a bug in the delayed work-queue used by some controllers.
The integer overflow bug would previously cause hotlooping behavior after a few failures
as time.Duration(..) on values larger than MaxInt64 behaves unpredictably, and
after a certain value returns 0 always.

cc @bprashanth @pwittrock


This change is Reviewable

@foxish foxish added the kind/bug Categorizes issue or PR as related to a bug. label Aug 25, 2016
@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-label-needed labels Aug 25, 2016
// The maximum exponent maxExp is chosen such that the 'calculated' value never
// exceeds MaxInt64.
maxExp := int(math.Log10(float64(math.MaxInt64)/float64(r.baseDelay.Nanoseconds())) - 1)
if exp > maxExp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe just cap it? the probelm is that time.Duration can't handle the float right

backoff := math.Pod10(r.failures[items])
if backoff >= math.MaxInt64 {
  return r.maxDelay
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, simplified with an early return.

@bprashanth
Copy link
Contributor

Just the one nit, thanks

@k8s-bot
Copy link

k8s-bot commented Aug 25, 2016

GCE e2e build/test passed for commit 00e4188.

@pwittrock
Copy link
Member

Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions.


pkg/util/workqueue/default_rate_limiters.go, line 97 [r2] (raw file):

  // The backoff is capped such that 'calculated' value never overflows.
  backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow10(exp)

I am surprised the exponent base is not 2 at the most. 10 seems really really steep (100 ms -> 1 s -> 10 s -> 100 s -> 1000s).


pkg/util/workqueue/default_rate_limiters.go, line 98 [r2] (raw file):

  // The backoff is capped such that 'calculated' value never overflows.
  backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow10(exp)
  if backoff > math.MaxInt64 {

I wonder what we should do here. I suspect the item probably shouldn't even be enqueued at this point. FWIW MaxInt64 in ns is ~500 Years. Will probably do a master upgrade before the duration expires :)

FWIW, I always liked this rater limiter implementation https://github.com/google/guava/blob/master/guava/src/com/google/common/util/concurrent/RateLimiter.java


Comments from Reviewable

@pwittrock
Copy link
Member

Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions.


pkg/util/workqueue/default_rate_limiters.go, line 97 [r2] (raw file):

  // The backoff is capped such that 'calculated' value never overflows.
  backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow10(exp)

We could potentially just short circuit this whole thing and say if the exp > 20, return or something.


Comments from Reviewable

@foxish
Copy link
Contributor Author

foxish commented Aug 25, 2016

pkg/util/workqueue/default_rate_limiters.go, line 98 [r2] (raw file):

Previously, pwittrock (Phillip Wittrock) wrote…

I wonder what we should do here. I suspect the item probably shouldn't even be enqueued at this point. FWIW MaxInt64 in ns is ~500 Years. Will probably do a master upgrade before the duration expires :)

FWIW, I always liked this rater limiter implementation https://github.com/google/guava/blob/master/guava/src/com/google/common/util/concurrent/RateLimiter.java

We cap the wait duration to maxDelay, which is typically 1000s (16 mins), so, we don't ever wait for that long, so, I don't think this is hit because of an error condition.

Comments from Reviewable

@foxish
Copy link
Contributor Author

foxish commented Aug 25, 2016

pkg/util/workqueue/default_rate_limiters.go, line 97 [r2] (raw file):

Previously, pwittrock (Phillip Wittrock) wrote…

We could potentially just short circuit this whole thing and say if the exp > 20, return or something.

I feel that is more brittle because it's the product of the baseDelay with the exp that we're checking. If the baseDelay is bigger, we might overflow before we hit 20 for example.

Comments from Reviewable

@deads2k
Copy link
Contributor

deads2k commented Aug 25, 2016

I don't have a 2 vs 10 preference.

lgtm. Seems like this should be 1.4.

@smarterclayton
Copy link
Contributor

This may be cherry pick candidate?

@pwittrock
Copy link
Member

@deads2k I agree that this should be in 1.4. Conviniently I am the release czar that gets to make those decisions :)
@smarterclayton Do you mean cherry pick into 1.3? I am hoping we can get this into 1.4 during the code slush.

@pwittrock
Copy link
Member

Reviewed 1 of 2 files at r1.
Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions.


Comments from Reviewable

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 25, 2016 via email

@pwittrock
Copy link
Member

+lgtm


Reviewed 1 of 2 files at r1.
Review status: 1 of 2 files reviewed at latest revision, 4 unresolved discussions.


Comments from Reviewable

@pwittrock pwittrock added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 25, 2016
@pwittrock pwittrock added this to the v1.4 milestone Aug 25, 2016
@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Aug 25, 2016
@pwittrock pwittrock added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Aug 25, 2016
@pwittrock pwittrock added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 25, 2016
@pwittrock
Copy link
Member

@foxish Thanks

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

calculated := r.baseDelay * time.Duration(math.Pow10(r.failures[item]-1))
// The backoff is capped such that 'calculated' value never overflows.
backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow10(exp)
if backoff > math.MaxInt64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cap exp before the Pow10 instead of doing a float64 comparison.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry didn't catch this last night. Youre implicitly converting maxint64 into a float larger than maxint64. You can early return with something like exp >= int64(math.Log10(math.MaxInt64)) instaed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will follow up with another PR fixing this.

@k8s-bot
Copy link

k8s-bot commented Aug 25, 2016

GCE e2e build/test passed for commit 00e4188.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit c63cd8f into kubernetes:master Aug 25, 2016
k8s-github-robot pushed a commit that referenced this pull request Aug 26, 2016
Automatic merge from submit-queue

Petset hotlooping issue.

Part of the fix for #27634
It completely fixes it when we also get #31396 merged.
k8s-github-robot pushed a commit to kubernetes-retired/contrib that referenced this pull request Sep 27, 2016
Automatic merge from submit-queue

Bump glbc to 0.8.0

Don't think this hits the bar for 1.4.0, but hopefully it can make 1.4.1. The version bump is for the godep update that fixes an issue with the throttling workqueue (kubernetes/kubernetes#31396). I should've done this sooner, dropped it.

Also fixes #1776 and #1783
aledbf pushed a commit to aledbf/contrib that referenced this pull request Nov 10, 2016
Automatic merge from submit-queue

Bump glbc to 0.8.0

Don't think this hits the bar for 1.4.0, but hopefully it can make 1.4.1. The version bump is for the godep update that fixes an issue with the throttling workqueue (kubernetes/kubernetes#31396). I should've done this sooner, dropped it.

Also fixes kubernetes-retired#1776 and kubernetes-retired#1783
@foxish foxish deleted the integer-overflow branch May 16, 2019 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller-manager kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants