Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlock in gc resync if available resources change during sync #64235

Merged
merged 2 commits into from
Jun 5, 2018

Conversation

liggitt
Copy link
Member

@liggitt liggitt commented May 23, 2018

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

xref #61057 #56446 (comment)

fixes a potential deadlock in the garbage collection controller

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 23, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 24, 2018
@roycaihw
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 24, 2018
@lavalamp
Copy link
Member

/assign @jennybuckley

@@ -288,15 +288,18 @@ func (gb *GraphBuilder) IsSynced() bool {
defer gb.monitorLock.Unlock()

if len(gb.monitors) == 0 {
glog.V(2).Info("garbage controller monitor not synced: no monitors")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why isn't the set of all monitors considered synced? It makes more sense to me to say that it is synced, because all 0 of the monitors are synced.

Copy link

@jennybuckley jennybuckley May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it used to do just that,

func (gb *GraphBuilder) HasSynced() bool {
for _, monitor := range gb.monitors {
if !monitor.HasSynced() {
return false
}
}
return true
}

But got changed by d08dfb9#diff-b26daf763d446f92d20456ca698ce237R303

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be open to revisiting this, but wanted to keep current behavior in this PR

@liggitt
Copy link
Member Author

liggitt commented May 25, 2018

/retest

for attempt := 0; ; attempt++ {
// If this is a re-attempt, check if available resources have changed
if attempt > 0 {
newResources = GetDeletableResources(discoveryClient)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to check if the list is empty again, like here https://github.com/kubernetes/kubernetes/pull/64235/files#diff-1481f32cd1a60ce2a66e5d8b5e6cf383R175 but instead of returning we should probably continue,

if GetDeletableResources encounters any errors it will return an empty list of resources. if we allow it to go forward with the sync with an empty list of resources, resyncMonitors will remove all the work that has been done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Probably add a protection against a hotloop here and in the resyncFailure case as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked for zero length rediscovery results, used wait.PollImmediateInfinite instead of a plain for loop to get hotloop and panic protection

close(waitForSyncStop)
}()

if controller.WaitForCacheSync("garbage collector", waitForSyncStop, gc.dependencyGraphBuilder.IsSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the monitor-cache-sync progress (for resources that aren't added or removed) won't get reset when we timeout, so we don't need exponential backoff timeout

the "timeout" is not limiting the time resource used for syncing cache. not sure if it's a pro or con

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "timeout" is not limiting the time resource used for syncing cache.

Can you clarify what you mean by this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the monitor-cache-sync progress (for resources that aren't added or removed) won't get reset when we timeout, so we don't need exponential backoff timeout

Agree we don't need exponential, a simple timeout before we attempt rediscovery/resync seems sufficient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. we discover a new resource and start a new monitor for it with gc.resyncMonitors(). We set a timeout of 30s and wait for the monitor cache to get synced. A long sync time (like 50s) won't violate the timeout. We don't need to increase the timeout, instead we just wait till the second timeout period. It was confusing to me that the timeout is not for guarding the sync time, it's the period when we decide to poll discovery again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it... that's what this comment was trying to explain, though I'm completely open to better wording (I understood what I meant, but... I didn't get that understanding from reading my own comment)

// wait for caches to fill for a while (our sync period) before attempting to rediscover resources and retry syncing.
// this protects us from deadlocks where available resources changed and one of our informer caches will never fill.
// informers keep attempting to sync in the background, so retrying doesn't interrupt them.
// the call to resyncMonitors on the reattempt will no-op for resources that still exist.
// note that workers stay paused until we successfully resync.
waitForSyncStop := make(chan struct{})

@liggitt
Copy link
Member Author

liggitt commented May 31, 2018

/test pull-kubernetes-e2e-gce

@liggitt liggitt changed the title WIP - Avoid deadlock in gc resync if available resources change during sync Avoid deadlock in gc resync if available resources change during sync May 31, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2018
@liggitt liggitt mentioned this pull request May 31, 2018
@liggitt liggitt force-pushed the gc-resync-reattempt branch from 77f0563 to 62ea336 Compare May 31, 2018 03:51
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 31, 2018
@liggitt
Copy link
Member Author

liggitt commented May 31, 2018

@roycaihw @jennybuckley I think this is ready. there are still other issues going on (seeing timeouts on watch-fed things, including garbage collection, just like in master), but this resolves the deadlock and unsafe worker restart (and adds logging to help us track down additional master flakes)

@deads2k can you take a look at this as well?

removed.Insert(fmt.Sprintf("%+v", old))
}
}
for new := range newResources {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't use new (keyword) as variable name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@liggitt liggitt force-pushed the gc-resync-reattempt branch from 9e67f06 to 565d96e Compare June 1, 2018 23:08
@roycaihw
Copy link
Member

roycaihw commented Jun 2, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018
@liggitt liggitt force-pushed the gc-resync-reattempt branch from 565d96e to 7da3d65 Compare June 2, 2018 01:54
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018
@liggitt
Copy link
Member Author

liggitt commented Jun 2, 2018

This is a pretty big change to not come with a test attached :)

ask and you shall receive

on master:

$ go test ./pkg/controller/garbagecollector -v -run TestGarbageCollectorSync
=== RUN   TestGarbageCollectorSync
Test outputE0601 21:04:55.934575   72279 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:04:56.939939   72279 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:04:57.928377   72279 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:04:57.928412   72279 garbagecollector.go:216] timed out waiting for dependency graph builder sync during GC sync
--- FAIL: TestGarbageCollectorSync (6.01s)
	garbagecollector_test.go:903: Expected garbagecollector.Sync to still be running but it is blocked: discoveryClient.ServerPreferredResources() called 0 times over 1s

with this change:

go test ./pkg/controller/garbagecollector -v -run TestGarbageCollectorSync
=== RUN   TestGarbageCollectorSync
Test outputE0601 21:06:28.509748   72353 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:06:28.520515   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.520547   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
E0601 21:06:28.634141   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.634222   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 2)
E0601 21:06:28.735250   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.735286   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 3)
E0601 21:06:28.835046   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.835081   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 4)
E0601 21:06:28.935800   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.935836   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 5)
E0601 21:06:29.036501   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.036535   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 6)
E0601 21:06:29.134059   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.134115   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 7)
E0601 21:06:29.235803   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.235837   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 8)
E0601 21:06:29.333922   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.334009   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 9)
E0601 21:06:29.436461   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.436510   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 10)
E0601 21:06:29.517583   72353 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
--- PASS: TestGarbageCollectorSync (6.02s)
PASS
ok  	k8s.io/kubernetes/pkg/controller/garbagecollector	6.073s

@liggitt
Copy link
Member Author

liggitt commented Jun 2, 2018

/milestone v1.11
/kind bug
/priority critical-urgent
/status approved-for-milestone

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. status/approved-for-milestone priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 2, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.11 milestone Jun 2, 2018
@jennybuckley
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2018
@liggitt
Copy link
Member Author

liggitt commented Jun 4, 2018

/assign caesarxuchao lavalamp

for approval

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@caesarxuchao @jennybuckley @lavalamp @liggitt @roycaihw

Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

// informers keep attempting to sync in the background, so retrying doesn't interrupt them.
// the call to resyncMonitors on the reattempt will no-op for resources that still exist.
// note that workers stay paused until we successfully resync.
if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout when waiting for cache sync seems to be useful in general, consider adding a WaitForCacheSyncUntil to client-go.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, jennybuckley, liggitt, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2018
@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants