Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Map Race by Moving Locking closer to the Write #2476

Merged
merged 4 commits into from
Apr 7, 2017

Conversation

gouthamve
Copy link
Member

Fixes #2444

I think we are protecting only the map as mentioned in the above issue. This is much cleaner and will avoid the race when there are multiple providers (We are launching a go routine per provider and not serializing access there).

The only question I have is whether this line needs to be protected: https://github.com/prometheus/prometheus/blob/master/discovery/discovery.go#L236

@beorn7
Copy link
Member

beorn7 commented Mar 8, 2017

@fabxc Could you have a look?

@fabxc
Copy link
Contributor

fabxc commented Mar 8, 2017

Mh, okay – tricky section to touch but seems generally sound.
However, in updateProviders() we reset the tgroups to an empty map. Updating go routines might not have exited yet in response to the context cancelation and may still write into it. So the reset at least has to be guarded as well.

@gouthamve
Copy link
Member Author

Hmm, true @fabxc. But I decided to ignore it as it was the behaviour before this PR too. A go routine in a previous loop could update the current set. I was hoping it would even out in the next run of the update.

Incase we want to do it right, the context would need to cancel the update midway too. Let me see if I can patch that.

* This will fix any old update routines writing into new updates,
@fabxc
Copy link
Contributor

fabxc commented Mar 9, 2017

This introduces another place where we check the context. However, we are not guaranteed at all that the updating goroutines even run after the context gets canceled or are scheduled before the check is happening. So a possible interleaving could be that:

  1. update function checks for context, its not canceled, after the select statement it gets unscheduled
  2. reload is called and it cancels the context and proceeds to overwriting tgroups with an empty map.

At any time during 2., the goroutine from 1. may be resurrected, thinking the context is still alive and continuing with modifying tgroups – now at the same time as the reloading goroutine is resetting it.

So line 321 must be mutex protected either way. The additional checking of the context reduces the likelihood of a race but does not remove it.
Adding more context checks is usually only helpful if the processing time in between calls may be long, which is not the case here I think. So we might want to drop it again.

@gouthamve
Copy link
Member Author

gouthamve commented Mar 9, 2017

Oh wow! Never thought of that! What do you suggest? Drop the latest commit and proceed with the first commit? I will rollback this commit then.

Also, I am not able to get a good idea of how we can actually fix this. Just writing it down for (my) clarity: The update go routine keeps running in the background and when a reset is requested, it needs to exit. But by the time it learns about the reset, it might be midway updating the map!

But any ill effect caused by having a previous entry will fix itself in the next update run.

@fabxc
Copy link
Contributor

fabxc commented Mar 12, 2017

Yes, as you summarized.

So guarding line 231/236 with another locking of the mutex in addition should solve the issue.

@gouthamve
Copy link
Member Author

gouthamve commented Mar 31, 2017

ping @fabxc I made the changes you requested.

@fabxc
Copy link
Contributor

fabxc commented Apr 7, 2017

LGTM

@fabxc fabxc merged commit 0f48d07 into prometheus:master Apr 7, 2017
@gouthamve gouthamve deleted the discovery-race-2444 branch April 7, 2017 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants