-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix incorrect use of reentrant locks #1076
Conversation
a19124d
to
b0210f5
Compare
Giving this some more thought, generally I agree that a deadlock or error would be better than incorrect data. Does this only happen for methods like @btimby, I would also like your thoughts as you are the one who needs the RLock behavior. Would having normal locks in values.py still be sufficient for your use case? |
Certainly, a deadlock is preferable to corrupt data. It may be the case that the change I proposed was too heavy-handed. In other words, my particular deadlock did not involve the changing of data, but the changing of the metrics registry. In fact, my problem occurred at startup. Manipulation of metric values may require a different locking strategy than registry access. I am actually using |
Great, that is what I was thinking, keep the reentrant lock where possible in the registry and use regular locks elsewhere. Do you have a link to a reproducible example? Otherwise I can try to set something up soon. |
I am sorry, I did not record anything at the time. This was just a small issue that I ran into when setting up a new project. I debugged the code and proposed a fix and moved on. This is what I described in the initial issue:
I did a bit of digging just now to refresh my memory, and here is what I think triggered my initial problem. Keep in mind that I am using Basically, I was adding a custom So I think you could recreate my issue simply by creating a Here is the code I believe caused the issue: from typing import Generator
from django.db.models import OuterRef, Subquery
from django.utils import timezone
from django.core.cache import cache
from prometheus_client.core import GaugeMetricFamily, Metric
from prometheus_client.registry import Collector
from schedule.models import MatchTimeSlot
from facility.models import Facility
METRIC_IN_USE = GaugeMetricFamily(
'two47dink_schedule_courts_in_use',
'Number of courts reserved for matches',
labels=['facility']
)
class MatchCollector(Collector):
"""
Collect court metrics for each facility
"""
def collect(self) -> Generator[Metric, None, None]:
now = timezone.now().replace(
minute=0, second=0, microsecond=0,
)
key = f'{__name__}.metrics-results'
cached_results = cache.get(key)
if not cached_results:
match_count = MatchTimeSlot.objects \
.filter(facility_id=OuterRef('id'), time=now) \
.values('spanning')
facilities = Facility.objects.all().annotate(
match_count=Subquery(match_count),
)
cached_results = {
facility.name: {'match_count': facility.match_count}
for facility in facilities
}
cache.set(key, cached_results, 60)
for facility_name, results in cached_results.items():
METRIC_IN_USE.add_metric(
labels=[facility_name], value=results.get('match_count') or 0.0
)
yield METRIC_IN_USE
def describe(self) -> Generator[Metric, None, None]:
yield METRIC_IN_USE I added the optional https://github.com/korfuri/django-prometheus/blob/master/django_prometheus/cache/backends/redis.py And now that I look at that code, it is indeed modifying counters related to cache access, so it may be modifying data (and not just the registry). OTOH, django does a lot of lazy loading and imports, so it may be the case that the cache backend is not loaded (and thus metrics are not initialized) until the cache is accessed. I know my original description of the problem specifically states that the registry is accessed in a nested manner, and I know that I did run this code under a debugger so I would tend to trust past me on this assessment. I hope this helps, I can try to provide more information if you need it. |
Oh, I think/hope the I'd expect that in practice people may have been running into deadlocks here (metric updates) mostly in case when they're handling signals (signal handlers invoked during the execution of the critical section held by the lock) and during signal handling they're updating some metrics. This is what we've seen and that's how I stumbled upon the RLock PR. In that case I think the proper fix is keeping the application's signal handling code very simple and not allowing it to update prometheus metrics (at least not in the multiprocess mode).
Hmm, the code protected by CollectorRegistry._lock does not look reentrant safe to me either. Same with locking in metrics.py. There's nothing special about the access happening from within the same thread that is currently holding a lock that would make it safe to re-enter these contexts. I think if anything, allowing that will make it harder to spot/debug the resulting issues?
This makes sense indeed if your custom Collector's .collect() resulted in lazy loading the django-prometheus' custom cache module which imports the module that registers the cache metrics in the first place. Perhaps a solution in your case would be to make sure the cache module is already loaded properly somehow. Or like you say, use the .describe() method which does not access the cache. Thank you both for looking at this! |
Yep, I agree. What I was thinking is that the same thread should be able to instantiate the same metric lazily which is why a reentrant lock sounded like a good idea, but yeah it is not safe in that case either. If your issue can be solved by implementing For this PR let's start simple and just move the RLocks back to Locks. I somewhat like the idea of raising an error instead, but I don't think value is the right place since it sounds like the use case is for a custom collector. |
b0210f5
to
eb874cf
Compare
Sure thing and done.
I think it can happen in all cases, and we have certainly been affected by these elusive deadlocks in production caused by our gunicorn wrapper instrumenting the gunicorn main process' logging with sentry and related prometheus metrics (recording http metrics for requests sent to sentry). So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. But the proper fix in this case is for us to be more careful/conservative with what we do during signal handling, or in this case what we instrumented gunicorn to do. We'd certainly find this issue quicker if there was an error instead. If people think it makes sense I can propose a separate PR with a change that adds this for all locks. |
This fixes a correctness bug introduced in 0014e97 resulting in lost updates during some scenarios. The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example. Signed-off-by: Przemysław Suliga <mail@suligap.net>
eb874cf
to
19561ae
Compare
This fixes a correctness bug introduced in 0014e97 resulting in lost updates during some scenarios. The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example. Signed-off-by: Przemysław Suliga <mail@suligap.net>
Beyond merging, I have backported this and released v0.21.1 with just this fix. |
I also do think a separate PR that raises an error is a good idea, or at least worth discussing further in that PR. |
Would it be overly hard to allow the user to select or override the locking object? Perhaps lock object is parameterized. Then raising could be achieved using a specialized lock. |
Detect deadlocks during the library misuse by injecting code into the critical sections that itself might want to obtain the lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
Detect deadlocks during the library misuse, eg. by injecting code into the critical sections that itself might want to obtain the relevant lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client. So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn). More details on prometheus client behavior in prometheus/client_python#1076
We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client. So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode prometheus client lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn). More details on prometheus client behavior in prometheus/client_python#1076
At least for now stop using the TaliskerRequestsTransport. In practice, for flask apps it was not used in the apps/workers because FlaskSentry was explicitly setting a None 'transport' in the client kwargs. This change makes sure that the same default raven (threaded) transport is used by logging in the main process. This will mitigate the deadlocks described in prometheus/client_python#1076 (comment) because there will be no prometheus instrumentation for sentry requests.
Detects and prevents deadlocks during the library misuse, eg. by injecting code into the critical sections that itself might want to obtain the relevant lock. A follow up to prometheus#1076. Signed-off-by: Przemysław Suliga <mail@suligap.net>
The first commit fixes a correctness bug introduced in
0014e97 resulting in eg. lost updates during some scenarios.
The code being locked is not reentrant safe. It's preferable to deadlock in these situations instead of silently loosing updates for example or allowing data corruption (mmap_dict).
Consider this example
Executing this with reentrant locks in place results in (both multiprocess and single process mode):
We would expect the value of the bug=race counter to be 110 however.
With this fix, this code will:
It's also likely to happen if someone tries to write metrics during signal handling. Again, it's preferable to deadlock instead of being incorrect IMHO.
I'm open to dropping the second commit (it does add some complexity and maybe not everyone will like the overhead of an extra lock), but it may be a useful quality of life improvement (crashing is usually better than a deadlock).
Let me know what you think!