Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quota controller worker deadlock #58107

Merged

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Jan 10, 2018

The resource quota controller worker pool can deadlock when:

  • Worker goroutines are idle waiting for work from queues
  • The Sync() method detects discovery updates to apply

The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.

The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.

Fix the deadlock by refactoring workers to acquire a read lock after work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.

Fixes a possible deadlock preventing quota from being recalculated

/cc @kubernetes/sig-api-machinery-bugs

The resource quota controller worker pool can deadlock when:

* Worker goroutines are idle waiting for work from queues
* The Sync() method detects discovery updates to apply

The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.

The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.

Fix the deadlock by refactoring workers to acquire a read lock *after* work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 10, 2018
@liggitt
Copy link
Member

liggitt commented Jan 10, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2018
@liggitt liggitt added this to the v1.9 milestone Jan 10, 2018
@liggitt liggitt added cherrypick-candidate priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jan 10, 2018
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@deads2k @derekwaynecarr @ironcladlou @liggitt @kubernetes/sig-api-machinery-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 6 days, the pull request will be moved out of the v1.9 milestone.

Pull Request Labels
  • sig/api-machinery: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@ironcladlou
Copy link
Contributor Author

Something @liggitt mentioned which I should add here: this change does introduce the possibility of Sync blocking workers which have popped from the queue, but Sync is periodic and short lived (and is a single place that could be easily made more strictly timebound). Lock-after-pop is the pattern employed in the garbage collector to avoid this same deadlock (and in fact the sync/worker pause code here seems to be based on the GC code).

Going forward I wonder if a worker pool drain/refill would be an improvement, but that would be a much more invasive change.

@derekwaynecarr
Copy link
Member

/approve no-issue

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, ironcladlou, liggitt

Associated issue requirement bypassed by: derekwaynecarr

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2018
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 15b1d16 into kubernetes:master Jan 10, 2018
@deads2k
Copy link
Contributor

deads2k commented Jan 11, 2018

@ironcladlou good catch

@k8s-cherrypick-bot
Copy link

Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@jennybuckley
Copy link

cc @jennybuckley

k8s-github-robot pushed a commit that referenced this pull request Jan 15, 2018
…58107-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #58107: Fix quota controller worker deadlock

Cherry pick of #58107 on release-1.9.

#58107: Fix quota controller worker deadlock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. milestone/needs-approval priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants