-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlocks and race conditions in mesos master election notifier #11298
Conversation
/cc @tsenart for internal review |
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
n.lock.Unlock() | ||
|
||
if newMaster { | ||
n.changed <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether this blocking call is really indented. I think we only have to make sure that the last "changed" event is not dropped.
@tsenart ptal |
@@ -46,8 +46,8 @@ type Service interface { | |||
} | |||
|
|||
type notifier struct { | |||
lock sync.Mutex | |||
cond *sync.Cond | |||
lock sync.Mutex // to protect the desired variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we place the lock next to the variable it protects?
LGTM |
Squash the touchups, please. Shippable failing due to Go 1.3 gofmt failures:
Empty range for loops aren't supported by 1.3, and 1.4 gofmt fails if you use an unused |
- n.node used the n.lock as underlaying locker. The service loop initially locked it, the Notify function tried to lock it before calling n.node.Signal, leading to a dead-lock. - the go routine calling ChangeMaster was not synchronized with the Notify method. The former was triggering change events that the later never saw when the former's startup was faster that of Notify. Hence, not even a single event was noticed and not even a single start/stop call of the slow service was triggered. This patch replaces the n.node condition object with a simple channel n.changed. The service loop watches it. Updating the notified private variables is still protected with n.lock against races, but independently of the n.changed channel. Hence, the deadlock is gone. Moreover, the startup of the Notify loop is synchronized with the go routine which changes the master. Hence, the Notify loop will see the master changes. Fixes kubernetes#10776
LGTM @davidopp This is a contrib/mesos-only bug fix for a flakey test. Does that means post-v1 or does it count as a test? |
LGTM IMO we can merge this, it just won't get cherrypicked into 1.0. But I'll let @erictune make the call on whether to cherrypick. |
Fix deadlocks and race conditions in mesos master election notifier
locked it, the Notify function tried to lock it before calling n.node.Signal,
leading to a dead-lock.
method. The former was triggering change events that the later never saw
when the former's startup was faster that of Notify. Hence, not even a single
event was noticed and not even a single start/stop call of the slow service
was triggered.
This patch replaces the n.node condition object with a simple channel n.changed.
The service loop watches it.
Updating the notified private variables is still protected with n.lock against
races, but independently of the n.changed channel. Hence, the deadlock is gone.
Moreover, the startup of the Notify loop is synchronized with the go routine which
changes the master. Hence, the Notify loop will see the master changes.
Fixes #10776