Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocks and race conditions in mesos master election notifier #11298

Merged
merged 1 commit into from
Jul 15, 2015

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Jul 15, 2015

  • n.node used the n.lock as underlaying locker. The service loop initially
    locked it, the Notify function tried to lock it before calling n.node.Signal,
    leading to a dead-lock.
  • the go routine calling ChangeMaster was not synchronized with the Notify
    method. The former was triggering change events that the later never saw
    when the former's startup was faster that of Notify. Hence, not even a single
    event was noticed and not even a single start/stop call of the slow service
    was triggered.

This patch replaces the n.node condition object with a simple channel n.changed.
The service loop watches it.

Updating the notified private variables is still protected with n.lock against
races, but independently of the n.changed channel. Hence, the deadlock is gone.

Moreover, the startup of the Notify loop is synchronized with the go routine which
changes the master. Hence, the Notify loop will see the master changes.

Fixes #10776

@sttts
Copy link
Contributor Author

sttts commented Jul 15, 2015

/cc @tsenart for internal review

@k8s-bot
Copy link

k8s-bot commented Jul 15, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

n.lock.Unlock()

if newMaster {
n.changed <- struct{}{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether this blocking call is really indented. I think we only have to make sure that the last "changed" event is not dropped.

@sttts
Copy link
Contributor Author

sttts commented Jul 15, 2015

@tsenart ptal

@@ -46,8 +46,8 @@ type Service interface {
}

type notifier struct {
lock sync.Mutex
cond *sync.Cond
lock sync.Mutex // to protect the desired variable
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we place the lock next to the variable it protects?

@tsenart
Copy link

tsenart commented Jul 15, 2015

LGTM

@karlkfi karlkfi self-assigned this Jul 15, 2015
@karlkfi
Copy link
Contributor

karlkfi commented Jul 15, 2015

Squash the touchups, please.

Shippable failing due to Go 1.3 gofmt failures:

./hack/verify-gofmt.sh
./contrib/mesos/pkg/election/master_test.go:86:6: expected operand, found 'range'
./contrib/mesos/pkg/election/master_test.go:89:2: expected '{', found 'go'
./contrib/mesos/pkg/election/master_test.go:105:3: expected '}', found 'EOF'

Empty range for loops aren't supported by 1.3, and 1.4 gofmt fails if you use an unused _ range return. So you have to use an int/len loop instead (lame, I know).

- n.node used the n.lock as underlaying locker. The service loop initially
  locked it, the Notify function tried to lock it before calling n.node.Signal,
  leading to a dead-lock.
- the go routine calling ChangeMaster was not synchronized with the Notify
  method. The former was triggering change events that the later never saw
  when the former's startup was faster that of Notify. Hence, not even a single
  event was noticed and not even a single start/stop call of the slow service
  was triggered.

This patch replaces the n.node condition object with a simple channel n.changed.
The service loop watches it.

Updating the notified private variables is still protected with n.lock against
races, but independently of the n.changed channel. Hence, the deadlock is gone.

Moreover, the startup of the Notify loop is synchronized with the go routine which
changes the master. Hence, the Notify loop will see the master changes.

Fixes kubernetes#10776
@sttts sttts changed the title WIP/Mesos: Fix deadlocks and race conditions in mesos master election notifier Fix deadlocks and race conditions in mesos master election notifier Jul 15, 2015
@karlkfi
Copy link
Contributor

karlkfi commented Jul 15, 2015

LGTM

@davidopp This is a contrib/mesos-only bug fix for a flakey test. Does that means post-v1 or does it count as a test?

@karlkfi karlkfi added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 15, 2015
@davidopp
Copy link
Member

LGTM

IMO we can merge this, it just won't get cherrypicked into 1.0. But I'll let @erictune make the call on whether to cherrypick.

erictune added a commit that referenced this pull request Jul 15, 2015
Fix deadlocks and race conditions in mesos master election notifier
@erictune erictune merged commit f5e6161 into kubernetes:master Jul 15, 2015
@sttts sttts deleted the fix-10776 branch September 14, 2015 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants