-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Map Race by Moving Locking closer to the Write #2476
Conversation
@fabxc Could you have a look? |
Mh, okay – tricky section to touch but seems generally sound. |
Hmm, true @fabxc. But I decided to ignore it as it was the behaviour before this PR too. A go routine in a previous loop could update the current set. I was hoping it would even out in the next run of the update. Incase we want to do it right, the context would need to cancel the update midway too. Let me see if I can patch that. |
* This will fix any old update routines writing into new updates,
This introduces another place where we check the context. However, we are not guaranteed at all that the updating goroutines even run after the context gets canceled or are scheduled before the check is happening. So a possible interleaving could be that:
At any time during 2., the goroutine from 1. may be resurrected, thinking the context is still alive and continuing with modifying So line 321 must be mutex protected either way. The additional checking of the context reduces the likelihood of a race but does not remove it. |
Oh wow! Never thought of that! What do you suggest? Drop the latest commit and proceed with the first commit? I will rollback this commit then. Also, I am not able to get a good idea of how we can actually fix this. Just writing it down for (my) clarity: The update go routine keeps running in the background and when a reset is requested, it needs to exit. But by the time it learns about the reset, it might be midway updating the map! But any ill effect caused by having a previous entry will fix itself in the next update run. |
Yes, as you summarized. So guarding line 231/236 with another locking of the mutex in addition should solve the issue. |
This reverts commit 02918b6.
ping @fabxc I made the changes you requested. |
LGTM |
Fixes #2444
I think we are protecting only the map as mentioned in the above issue. This is much cleaner and will avoid the race when there are multiple providers (We are launching a go routine per provider and not serializing access there).
The only question I have is whether this line needs to be protected: https://github.com/prometheus/prometheus/blob/master/discovery/discovery.go#L236