Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE: Fix operation polling and error handling #64630

Merged
merged 2 commits into from
Jun 14, 2018

Conversation

nicksardo
Copy link
Contributor

@nicksardo nicksardo commented Jun 1, 2018

Cloud functions using the generated API are bursting operation GET calls because we don't wait a minimum amount of time.

Fixes #64712
Fixes #64858

Changes

  • operationPollInterval is now 1 second instead of 3 seconds.
  • operationPollRateLimiter is now configured with 5 QPS / 5 burst instead of 10 QPS / 10 burst.
  • gceRateLimiter is now configured with a MinimumRateLimiter to wait the above operationPollInterval duration before waiting on the token rate limiter.
  • Operations are now rate limited on the very first GET call.
  • Operations are polled until DONE or context times out (even if operations.get fails continuously).
  • Compute operations are checked for errors when they're recognized as DONE.
  • All "wrapper" funcs now generate a context with an hour timeout.

ingress-gce will need to update its vendor and utilize the MinimumRateLimiter as well. Since ingress creates rate limiters based off flags, we'll need to check the resource type and operation while parsing the flags and wrap the appropriate one.

Special notes for your reviewer:
/assign bowei
/cc bowei

Fix Example
Creating an external load balancer

without fix: https://pastebin.com/raw/NNkeNWS3
with fix: https://pastebin.com/raw/x2iMLW5S (a difference of about 200 GET calls)

Release note:

GCE: Fixes operation polling to adhere to the specified interval. Furthermore, operation errors are now returned instead of ignored.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 1, 2018
@k8s-ci-robot k8s-ci-robot requested a review from bowei June 1, 2018 20:24
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 1, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 8, 2018
@verult
Copy link
Contributor

verult commented Jun 11, 2018

/cc

@k8s-ci-robot k8s-ci-robot requested a review from verult June 11, 2018 17:20
@nicksardo nicksardo force-pushed the fix-op-rate branch 5 times, most recently from 9210eaa to 72f68e2 Compare June 11, 2018 21:56
@nicksardo
Copy link
Contributor Author

/cc @rramkumar1

@k8s-ci-robot k8s-ci-robot requested a review from rramkumar1 June 11, 2018 22:04
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jun 12, 2018
@nicksardo nicksardo force-pushed the fix-op-rate branch 2 times, most recently from f69e15c to 20b3bcc Compare June 12, 2018 16:14
@@ -27,6 +27,10 @@ import (
ga "google.golang.org/api/compute/v1"
)

const (
maxOperationGetErrorStreak = 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this maxConsecutiveOperationGetErrors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

// If an error occurs retrieving the operation, the loop will continue for `maxOpGetRetries` then
// finally error. This is to prevent a transient error from bubbling up to controller-level logic.
func (s *Service) pollOperation(ctx context.Context, op operation) error {
var errorStreak int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consecutiveErrors

@nicksardo nicksardo force-pushed the fix-op-rate branch 2 times, most recently from 156bff2 to 891080a Compare June 12, 2018 16:42
@nicksardo nicksardo changed the title GCE: Wait a minimum amount of time for polling operations GCE: Fix operation polling and error handling Jun 12, 2018
@nicksardo nicksardo force-pushed the fix-op-rate branch 2 times, most recently from 3632b39 to 20e662b Compare June 12, 2018 16:52
@rramkumar1
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link
Contributor

@nicksardo: You must be a member of the kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/kind bug
/priority critical-urgent
/milestone v1.11
/sig gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/gcp labels Jun 13, 2018
@nicksardo
Copy link
Contributor Author

nicksardo commented Jun 13, 2018

PTAL @bowei


// Error returns a string representation including the HTTP Status code, GCE's error code
// and a human readable message.
func (e GCEOperationError) Error() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e *GCEOperationError)

}

// Error returns a string representation including the last poll error encountered.
func (e OperationPollingError) Error() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointer

// acceptor is an object which blocks within Accept until a call is allowed to run.
// Accept is a behavior of the flowcontrol.RateLimiter interface.
type acceptor interface {
Accept()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golint


// AcceptRateLimiter wraps an Acceptor with RateLimiter parameters.
type AcceptRateLimiter struct {
Acceptor acceptor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golint

// MinimumRateLimiter wraps a RateLimiter and will only call its Accept until the minimum
// duration has been met or the context is cancelled.
type MinimumRateLimiter struct {
RateLimiter RateLimiter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golint

// returning ctx.Err().
select {
case <-ctx.Done():
glog.V(5).Infof("op.pollOperation(%v, %v) not completed, poll count = %v, ctx.Err = %v", ctx, op, pollCount, ctx.Err())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %d for ints

glog.V(5).Infof("op.isDone(%v) complete; op = %v", ctx, op)
return nil

glog.V(5).Infof("op.isDone(%v) complete; op = %v, poll count = %v, op.err = %v", ctx, op, pollCount, op.error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %d for ints

)

func TestPollOperation(t *testing.T) {
var attempts int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const totalAttempts = 11

func TestPollOperation(t *testing.T) {
var attempts int
fo := &fakeOperation{isDoneFunc: func(ctx context.Context) (bool, error) {
if attempts <= 10 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< totalAttemps

t.Errorf("pollOperation() = %v, want nil", err)
}
if attempts != 11 {
t.Errorf("`attempts` = %v, want 11", attempts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want %d", ..., totalAttempts

@bowei
Copy link
Member

bowei commented Jun 14, 2018

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, nicksardo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bowei
Copy link
Member

bowei commented Jun 14, 2018

/milestone v1.11
/status approved-for-milestone

@k8s-ci-robot k8s-ci-robot added this to the v1.11 milestone Jun 14, 2018
@k8s-ci-robot k8s-ci-robot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Jun 14, 2018
@k8s-ci-robot
Copy link
Contributor

@bowei: dog image

In response to this:

/woof

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@bowei @nicksardo

Pull Request Labels
  • sig/gcp: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@dims
Copy link
Member

dims commented Jun 14, 2018

/test pull-kubernetes-bazel-build

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64272, 64630). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 14, 2018

@nicksardo: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 787f3a6 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot pushed a commit that referenced this pull request Jun 18, 2018
…630-upstream-release-1.10

Automatic merge from submit-queue.

Automated cherry pick of #64630: Wait a minimum amount of time for polling operations

Cherry pick of #64630 on release-1.10.

#64630: Wait a minimum amount of time for polling operations
e := op.Error.Errors[0]
o.err = &GCEOperationError{HTTPStatusCode: op.HTTPStatusCode, Code: e.Code, Message: e.Message}
}
return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of these code are very similar. actually you can embed a struct with an error method. When I wrote this, I realized that it also applies to String() and isDone(). feel free to ignore, totally optional. maybe i will submit a PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCE: Call operations are not checked for errors GCE: Operation ratelimiting does not poll on an interval
8 participants