Avoid deadlock in gc resync if available resources change during sync #64235

liggitt · 2018-05-23T22:59:45Z

retry GC sync if waiting for cache sync times out, without unpausing workers

viewing ignoring whitespace reveals the actual change:
https://github.com/kubernetes/kubernetes/pull/64235/files?w=1

fixes a potential deadlock in the garbage collection controller

roycaihw · 2018-05-24T18:35:50Z

/sig api-machinery

lavalamp · 2018-05-24T20:07:46Z

/assign @jennybuckley

jennybuckley · 2018-05-24T22:16:46Z

pkg/controller/garbagecollector/graph_builder.go

@@ -288,15 +288,18 @@ func (gb *GraphBuilder) IsSynced() bool {
 	defer gb.monitorLock.Unlock()

 	if len(gb.monitors) == 0 {
+		glog.V(2).Info("garbage controller monitor not synced: no monitors")


Just curious, why isn't the set of all monitors considered synced? It makes more sense to me to say that it is synced, because all 0 of the monitors are synced.

Looks like it used to do just that,

kubernetes/pkg/controller/garbagecollector/graph_builder.go

Lines 216 to 223 in 3d3d392

func (gb *GraphBuilder) HasSynced() bool {

for _, monitor := range gb.monitors {

if !monitor.HasSynced() {

return false

}

}

return true

}

But got changed by d08dfb9#diff-b26daf763d446f92d20456ca698ce237R303

I'd be open to revisiting this, but wanted to keep current behavior in this PR

liggitt · 2018-05-25T21:08:07Z

/retest

jennybuckley · 2018-05-25T23:01:46Z

pkg/controller/garbagecollector/garbagecollector.go

+		for attempt := 0; ; attempt++ {
+			// If this is a re-attempt, check if available resources have changed
+			if attempt > 0 {
+				newResources = GetDeletableResources(discoveryClient)


We may want to check if the list is empty again, like here https://github.com/kubernetes/kubernetes/pull/64235/files#diff-1481f32cd1a60ce2a66e5d8b5e6cf383R175 but instead of returning we should probably continue,

if GetDeletableResources encounters any errors it will return an empty list of resources. if we allow it to go forward with the sync with an empty list of resources, resyncMonitors will remove all the work that has been done.

Good idea. Probably add a protection against a hotloop here and in the resyncFailure case as well

checked for zero length rediscovery results, used wait.PollImmediateInfinite instead of a plain for loop to get hotloop and panic protection

roycaihw · 2018-05-25T23:36:00Z

pkg/controller/garbagecollector/garbagecollector.go

+				close(waitForSyncStop)
+			}()
+
+			if controller.WaitForCacheSync("garbage collector", waitForSyncStop, gc.dependencyGraphBuilder.IsSynced) {


nit: the monitor-cache-sync progress (for resources that aren't added or removed) won't get reset when we timeout, so we don't need exponential backoff timeout

the "timeout" is not limiting the time resource used for syncing cache. not sure if it's a pro or con

the "timeout" is not limiting the time resource used for syncing cache.

Can you clarify what you mean by this

nit: the monitor-cache-sync progress (for resources that aren't added or removed) won't get reset when we timeout, so we don't need exponential backoff timeout

Agree we don't need exponential, a simple timeout before we attempt rediscovery/resync seems sufficient

E.g. we discover a new resource and start a new monitor for it with gc.resyncMonitors(). We set a timeout of 30s and wait for the monitor cache to get synced. A long sync time (like 50s) won't violate the timeout. We don't need to increase the timeout, instead we just wait till the second timeout period. It was confusing to me that the timeout is not for guarding the sync time, it's the period when we decide to poll discovery again.

got it... that's what this comment was trying to explain, though I'm completely open to better wording (I understood what I meant, but... I didn't get that understanding from reading my own comment)

// wait for caches to fill for a while (our sync period) before attempting to rediscover resources and retry syncing. // this protects us from deadlocks where available resources changed and one of our informer caches will never fill. // informers keep attempting to sync in the background, so retrying doesn't interrupt them. // the call to resyncMonitors on the reattempt will no-op for resources that still exist. // note that workers stay paused until we successfully resync. waitForSyncStop := make(chan struct{})

liggitt · 2018-05-31T03:36:12Z

/test pull-kubernetes-e2e-gce

liggitt · 2018-05-31T14:52:16Z

@roycaihw @jennybuckley I think this is ready. there are still other issues going on (seeing timeouts on watch-fed things, including garbage collection, just like in master), but this resolves the deadlock and unsafe worker restart (and adds logging to help us track down additional master flakes)

@deads2k can you take a look at this as well?

deads2k · 2018-05-31T17:16:40Z

pkg/controller/garbagecollector/garbagecollector.go

+					removed.Insert(fmt.Sprintf("%+v", old))
+				}
+			}
+			for new := range newResources {


nit: don't use new (keyword) as variable name

roycaihw · 2018-06-02T00:11:29Z

/lgtm

liggitt · 2018-06-02T01:54:45Z

This is a pretty big change to not come with a test attached :)

ask and you shall receive

on master:

$ go test ./pkg/controller/garbagecollector -v -run TestGarbageCollectorSync
=== RUN   TestGarbageCollectorSync
Test outputE0601 21:04:55.934575   72279 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:04:56.939939   72279 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:04:57.928377   72279 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:04:57.928412   72279 garbagecollector.go:216] timed out waiting for dependency graph builder sync during GC sync
--- FAIL: TestGarbageCollectorSync (6.01s)
	garbagecollector_test.go:903: Expected garbagecollector.Sync to still be running but it is blocked: discoveryClient.ServerPreferredResources() called 0 times over 1s

with this change:

go test ./pkg/controller/garbagecollector -v -run TestGarbageCollectorSync
=== RUN   TestGarbageCollectorSync
Test outputE0601 21:06:28.509748   72353 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
E0601 21:06:28.520515   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.520547   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
E0601 21:06:28.634141   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.634222   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 2)
E0601 21:06:28.735250   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.735286   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 3)
E0601 21:06:28.835046   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.835081   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 4)
E0601 21:06:28.935800   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:28.935836   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 5)
E0601 21:06:29.036501   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.036535   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 6)
E0601 21:06:29.134059   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.134115   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 7)
E0601 21:06:29.235803   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.235837   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 8)
E0601 21:06:29.333922   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.334009   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 9)
E0601 21:06:29.436461   72353 controller_utils.go:1028] Unable to sync caches for garbage collector controller
E0601 21:06:29.436510   72353 garbagecollector.go:233] timed out waiting for dependency graph builder sync during GC sync (attempt 10)
E0601 21:06:29.517583   72353 reflector.go:205] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: the server could not find the requested resource (get secrets)
--- PASS: TestGarbageCollectorSync (6.02s)
PASS
ok  	k8s.io/kubernetes/pkg/controller/garbagecollector	6.073s

liggitt · 2018-06-02T14:34:17Z

/milestone v1.11
/kind bug
/priority critical-urgent
/status approved-for-milestone

jennybuckley · 2018-06-04T18:01:40Z

/lgtm

liggitt · 2018-06-04T18:15:27Z

/assign caesarxuchao lavalamp

for approval

k8s-github-robot · 2018-06-04T18:16:18Z

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@caesarxuchao @jennybuckley @lavalamp @liggitt @roycaihw

Pull Request Labels

sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

caesarxuchao

/approve

caesarxuchao · 2018-06-04T23:09:40Z

pkg/controller/garbagecollector/garbagecollector.go

+			// informers keep attempting to sync in the background, so retrying doesn't interrupt them.
+			// the call to resyncMonitors on the reattempt will no-op for resources that still exist.
+			// note that workers stay paused until we successfully resync.
+			if !controller.WaitForCacheSync("garbage collector", waitForStopOrTimeout(stopCh, period), gc.dependencyGraphBuilder.IsSynced) {


Timeout when waiting for cache sync seems to be useful in general, consider adding a WaitForCacheSyncUntil to client-go.

k8s-ci-robot · 2018-06-04T23:51:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caesarxuchao, jennybuckley, liggitt, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/garbagecollector/OWNERS~~ [caesarxuchao]
~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [caesarxuchao,liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-06-05T01:44:18Z

Automatic merge from submit-queue (batch tested with PRs 62266, 64351, 64366, 64235, 64560). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot requested review from caesarxuchao and lavalamp May 23, 2018 23:00

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 24, 2018

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 24, 2018

k8s-ci-robot assigned jennybuckley May 24, 2018

liggitt force-pushed the gc-resync-reattempt branch from 9daa867 to 03d55fc Compare May 24, 2018 20:42

jennybuckley reviewed May 24, 2018

View reviewed changes

roycaihw mentioned this pull request May 25, 2018

Fix GC resource-discovery/cache-sync race condition #61057

Closed

jennybuckley reviewed May 25, 2018

View reviewed changes

roycaihw reviewed May 25, 2018

View reviewed changes

liggitt force-pushed the gc-resync-reattempt branch 2 times, most recently from 9dad2d5 to 77f0563 Compare May 26, 2018 01:20

roycaihw mentioned this pull request May 30, 2018

[Flaky tests]: [sig-api-machinery] Garbage collector should support cascading deletion of custom resources and [sig-api-machinery] Garbage collector should not be blocked by dependency circle [Conformance] #63941

Closed

liggitt changed the title ~~WIP - Avoid deadlock in gc resync if available resources change during sync~~ Avoid deadlock in gc resync if available resources change during sync May 31, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 31, 2018

liggitt mentioned this pull request May 31, 2018

WIP - GC debug logging #64544

Closed

liggitt force-pushed the gc-resync-reattempt branch from 77f0563 to 62ea336 Compare May 31, 2018 03:51

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 31, 2018

deads2k reviewed May 31, 2018

View reviewed changes

liggitt force-pushed the gc-resync-reattempt branch from 9e67f06 to 565d96e Compare June 1, 2018 23:08

k8s-ci-robot assigned roycaihw Jun 2, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018

Avoid deadlock in gc resync if available resources change during sync

7da3d65

liggitt force-pushed the gc-resync-reattempt branch from 565d96e to 7da3d65 Compare June 2, 2018 01:54

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. status/approved-for-milestone priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 2, 2018

k8s-ci-robot added this to the v1.11 milestone Jun 2, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2018

k8s-ci-robot assigned caesarxuchao and lavalamp Jun 4, 2018

caesarxuchao reviewed Jun 4, 2018

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2018

k8s-github-robot merged commit 9fceab1 into kubernetes:master Jun 5, 2018

liggitt deleted the gc-resync-reattempt branch June 6, 2018 15:44

liggitt mentioned this pull request Jun 6, 2018

[e2e flake] Garbage collector should delete pods created by rc when not orphaning #63714

Closed

liggitt mentioned this pull request Jul 24, 2018

Pod stuck Terminating will not deregister from API #66574

Closed

This was referenced Nov 28, 2018

ResourceQuota objects not refreshing anymore (while admission controller still working) #71450

Closed

Fix quota controller discovery deadlock #71516

Closed

liggitt mentioned this pull request Feb 28, 2019

quota controller fixes #74747

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid deadlock in gc resync if available resources change during sync #64235

Avoid deadlock in gc resync if available resources change during sync #64235

liggitt commented May 23, 2018 •

edited

Loading

roycaihw commented May 24, 2018

lavalamp commented May 24, 2018

jennybuckley May 24, 2018

jennybuckley May 24, 2018 •

edited

Loading

liggitt May 29, 2018

liggitt commented May 25, 2018

jennybuckley May 25, 2018

liggitt May 25, 2018

liggitt May 26, 2018

roycaihw May 25, 2018

liggitt May 25, 2018

liggitt May 25, 2018

roycaihw May 26, 2018

liggitt May 26, 2018

liggitt commented May 31, 2018

liggitt commented May 31, 2018

deads2k May 31, 2018

liggitt May 31, 2018

roycaihw commented Jun 2, 2018

liggitt commented Jun 2, 2018

liggitt commented Jun 2, 2018

jennybuckley commented Jun 4, 2018

liggitt commented Jun 4, 2018

k8s-github-robot commented Jun 4, 2018

caesarxuchao left a comment

caesarxuchao Jun 4, 2018

k8s-ci-robot commented Jun 4, 2018

k8s-github-robot commented Jun 5, 2018

	func (gb *GraphBuilder) HasSynced() bool {
	for _, monitor := range gb.monitors {
	if !monitor.HasSynced() {
	return false
	}
	}
	return true
	}

Avoid deadlock in gc resync if available resources change during sync #64235

Avoid deadlock in gc resync if available resources change during sync #64235

Conversation

liggitt commented May 23, 2018 • edited Loading

roycaihw commented May 24, 2018

lavalamp commented May 24, 2018

Choose a reason for hiding this comment

jennybuckley May 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented May 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt commented May 31, 2018

liggitt commented May 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roycaihw commented Jun 2, 2018

liggitt commented Jun 2, 2018

liggitt commented Jun 2, 2018

jennybuckley commented Jun 4, 2018

liggitt commented Jun 4, 2018

k8s-github-robot commented Jun 4, 2018

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 4, 2018

k8s-github-robot commented Jun 5, 2018

liggitt commented May 23, 2018 •

edited

Loading

jennybuckley May 24, 2018 •

edited

Loading