Fix Map Race by Moving Locking closer to the Write #2476

gouthamve · 2017-03-06T19:34:49Z

I think we are protecting only the map as mentioned in the above issue. This is much cleaner and will avoid the race when there are multiple providers (We are launching a go routine per provider and not serializing access there).

The only question I have is whether this line needs to be protected: https://github.com/prometheus/prometheus/blob/master/discovery/discovery.go#L236

beorn7 · 2017-03-08T10:01:34Z

@fabxc Could you have a look?

fabxc · 2017-03-08T10:17:43Z

Mh, okay – tricky section to touch but seems generally sound.
However, in updateProviders() we reset the tgroups to an empty map. Updating go routines might not have exited yet in response to the context cancelation and may still write into it. So the reset at least has to be guarded as well.

gouthamve · 2017-03-08T11:48:36Z

Hmm, true @fabxc. But I decided to ignore it as it was the behaviour before this PR too. A go routine in a previous loop could update the current set. I was hoping it would even out in the next run of the update.

Incase we want to do it right, the context would need to cancel the update midway too. Let me see if I can patch that.

* This will fix any old update routines writing into new updates,

fabxc · 2017-03-09T15:18:52Z

This introduces another place where we check the context. However, we are not guaranteed at all that the updating goroutines even run after the context gets canceled or are scheduled before the check is happening. So a possible interleaving could be that:

update function checks for context, its not canceled, after the select statement it gets unscheduled
reload is called and it cancels the context and proceeds to overwriting tgroups with an empty map.

At any time during 2., the goroutine from 1. may be resurrected, thinking the context is still alive and continuing with modifying tgroups – now at the same time as the reloading goroutine is resetting it.

So line 321 must be mutex protected either way. The additional checking of the context reduces the likelihood of a race but does not remove it.
Adding more context checks is usually only helpful if the processing time in between calls may be long, which is not the case here I think. So we might want to drop it again.

gouthamve · 2017-03-09T15:40:17Z

Oh wow! Never thought of that! What do you suggest? Drop the latest commit and proceed with the first commit? I will rollback this commit then.

Also, I am not able to get a good idea of how we can actually fix this. Just writing it down for (my) clarity: The update go routine keeps running in the background and when a reset is requested, it needs to exit. But by the time it learns about the reset, it might be midway updating the map!

But any ill effect caused by having a previous entry will fix itself in the next update run.

fabxc · 2017-03-12T13:19:37Z

Yes, as you summarized.

So guarding line 231/236 with another locking of the mutex in addition should solve the issue.

This reverts commit 02918b6.

gouthamve · 2017-03-31T17:00:29Z

ping @fabxc I made the changes you requested.

fabxc · 2017-04-07T06:54:33Z

LGTM

Fix Map Race by Moving Locking closer to the Write

3dce5df

Make Updates Honour Context

02918b6

* This will fix any old update routines writing into new updates,

gouthamve added 2 commits March 12, 2017 20:47

Revert "Make Updates Honour Context"

79f3e17

This reverts commit 02918b6.

Protect Map Replacement

096bc5f

fabxc merged commit 0f48d07 into prometheus:master Apr 7, 2017

gouthamve deleted the discovery-race-2444 branch April 7, 2017 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Map Race by Moving Locking closer to the Write #2476

Fix Map Race by Moving Locking closer to the Write #2476

gouthamve commented Mar 6, 2017

beorn7 commented Mar 8, 2017

fabxc commented Mar 8, 2017

gouthamve commented Mar 8, 2017

fabxc commented Mar 9, 2017 •

edited

Loading

gouthamve commented Mar 9, 2017 •

edited

Loading

fabxc commented Mar 12, 2017

gouthamve commented Mar 31, 2017 •

edited

Loading

fabxc commented Apr 7, 2017

Fix Map Race by Moving Locking closer to the Write #2476

Fix Map Race by Moving Locking closer to the Write #2476

Conversation

gouthamve commented Mar 6, 2017

beorn7 commented Mar 8, 2017

fabxc commented Mar 8, 2017

gouthamve commented Mar 8, 2017

fabxc commented Mar 9, 2017 • edited Loading

gouthamve commented Mar 9, 2017 • edited Loading

fabxc commented Mar 12, 2017

gouthamve commented Mar 31, 2017 • edited Loading

fabxc commented Apr 7, 2017

fabxc commented Mar 9, 2017 •

edited

Loading

gouthamve commented Mar 9, 2017 •

edited

Loading

gouthamve commented Mar 31, 2017 •

edited

Loading