Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent garbage collector from attempting to sync with 0 resources #61201

Merged
merged 2 commits into from
Mar 16, 2018

Conversation

jennybuckley
Copy link

What this PR does / why we need it:
As of #55259 we enabled garbagecollector.GetDeletableResources to return partial discovery results (including an empty set of discovery results).
This had the unintended consequence of allowing the garbage collector to enter a blocked state that can only be fixed by restarting.

From this comment:

  1. The Sync function periodically calls GetDeletableResources

  2. According to the comment above GetDeletableResources, All discovery errors are considered temporary. Upon encountering any error, GetDeletableResources will log and return any discovered resources it was able to process (which may be none)., an error in discovery causes the discovery client to no longer discover resources in the cluster, but instead of failing and returning an error, it simply logs the error as garbagecollector.go:601] failed to discover preferred resources: %vthe server was unable to return a response in the time allotted, but may still be processing the request and returns an empty list of resources

  3. The Sync function, upon recieving an empty resource list from discovery, detects that the resources have changed, and calls resyncMonitors, which calls dependencyGraphBuilder.syncMonitors with map[] as the argument as shown in the log as garbagecollector.go:189] syncing garbage collector with updated resources from discovery: map[], which sets the list of monitors to an empty list because it thinks there are no resources to monitor.

  4. Lastly the Sync function calls controller.WaitForCacheSync, which calls cache.WaitForCacheSync, which will continually retry the garbagecollector.IsSynced function until it returns true, but it will always return false because len(gb.monitors) is 0.

This PR prevents that specific race condition from arising.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60037

Release note:

Fix bug allowing garbage collector to enter a broken state that could only be fixed by restarting the controller-manager.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 14, 2018
@jennybuckley
Copy link
Author

/sig api-machinery
/kind bug
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 14, 2018
@jennybuckley
Copy link
Author

@lavalamp
Copy link
Member

Looks awesome, and given the failure mode, looks like an equally awesome unit test shouldn't be too hard to add? :)

@jennybuckley
Copy link
Author

jennybuckley commented Mar 14, 2018

@lavalamp I have a unit test running locally which fails before this commit and passes after, but it only works by modifying the code for the monitors so they always say they are synced. Once I figure out how to fake that part, or get them to actually sync, I will add it here.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 15, 2018
@jennybuckley jennybuckley force-pushed the fix-gc-empty-map branch 2 times, most recently from 50cc566 to 532872b Compare March 15, 2018 22:29
@jennybuckley
Copy link
Author

Resubmitting with just the test, to show that it fails, after that I'll submit the fix to show that it passes.

@jennybuckley
Copy link
Author

@lavalamp
Copy link
Member

I am a little worried that the test will be racy due to the time.Sleep()s, but I really want this fix. Maybe you can work on a followup?

/label status/approved-for-milestone

/lgtm
/approved

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jennybuckley, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 16, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request: Up-to-date for process

@jennybuckley @lavalamp

Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 61284, 61119, 61201). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit f8f67da into kubernetes:master Mar 16, 2018
@janetkuo
Copy link
Member

This is awesome, thanks @jennybuckley! Since this is introduced in 1.9, can we cherrypick this to 1.9?

@jennybuckley
Copy link
Author

Yes, also the bug causing PR has been cherrypicked to 1.8 as well, so I think we should cherrypick this to 1.8 and 1.9

jpbetz added a commit that referenced this pull request Mar 20, 2018
…#61201-upstream-release-1.9

Automated cherry pick of #61201: Prevent garbage collector from attempting to sync with 0 resources
k8s-github-robot pushed a commit that referenced this pull request Mar 29, 2018
…#61201-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #61201: Prevent garbage collector from attempting to sync with 0 resources

Cherry pick of #61201 on release-1.8.

#61201: Prevent garbage collector from attempting to sync with 0 resources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage collector in controller manager stops working
5 participants