-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix quota controller worker deadlock #58107
Fix quota controller worker deadlock #58107
Conversation
The resource quota controller worker pool can deadlock when: * Worker goroutines are idle waiting for work from queues * The Sync() method detects discovery updates to apply The problem is workers acquire a read lock while idle, making write lock acquisition dependent upon the presence of work in the queues. The Sync() method blocks on a pending write lock acquisition and won't unblock until every existing worker processes one item from their queue and releases their read lock. While the Sync() method's lock is pending, all new read lock acquisitions will block; if a worker does process work and release its lock, it will then become blocked on a read lock acquisition; they become blocked on Sync(). This can easily deadlock all the workers processing from one queue while any workers on the other queue remain blocked waiting for work. Fix the deadlock by refactoring workers to acquire a read lock *after* work is popped from the queue. This allows writers to get locks while workers are idle, while preserving the worker pause semantics necessary to allow safe sync.
/lgtm |
[MILESTONENOTIFIER] Milestone Pull Request Needs Approval @deads2k @derekwaynecarr @ironcladlou @liggitt @kubernetes/sig-api-machinery-misc Action required: This pull request must have the Pull Request Labels
|
Something @liggitt mentioned which I should add here: this change does introduce the possibility of Going forward I wonder if a worker pool drain/refill would be an improvement, but that would be a much more invasive change. |
/approve no-issue |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, ironcladlou, liggitt Associated issue requirement bypassed by: derekwaynecarr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
@ironcladlou good catch |
Commit found in the "release-1.9" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
The resource quota controller worker pool can deadlock when:
The problem is workers acquire a read lock while idle, making write lock
acquisition dependent upon the presence of work in the queues.
The Sync() method blocks on a pending write lock acquisition and won't unblock
until every existing worker processes one item from their queue and releases
their read lock. While the Sync() method's lock is pending, all new read lock
acquisitions will block; if a worker does process work and release its lock, it
will then become blocked on a read lock acquisition; they become blocked on
Sync(). This can easily deadlock all the workers processing from one queue while
any workers on the other queue remain blocked waiting for work.
Fix the deadlock by refactoring workers to acquire a read lock after work is
popped from the queue. This allows writers to get locks while workers are idle,
while preserving the worker pause semantics necessary to allow safe sync.
/cc @kubernetes/sig-api-machinery-bugs