-
-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor metrics to be more accurate and usefull #68
Comments
Hello! Thanks for detailed proposal! I agree with almost everything, the only thing: at work we send metrics directly to Graphite and just want to see actual values aggregated over one minute interval, so for us it's important to keep current behaviour with counters. So it seems we need to support both ways And I especially agree with you thoughts about current timer behaviour - absolutely useless:) |
Interesting, how does that work - I haven't seen options to configure output to graphite in the code? Keeping optional time-based counters seems reasonable - I did originally suggest that but my ticket got too long so I simplified ;) |
Hm I see go-metrics supports graphite output but I don't see any code that configures that, plus if you did it at go-metrics level it wouldn't use the existing metrics interval stuff anyway... |
Our main site written in Django and we use Celery to do async or periodic jobs, so we just ask @app.task
def export_centrifugo_metrics():
client = Client(settings.CENTRIFUGO_ADDRESS, settings.CENTRIFUGO_SECRET, timeout=2)
stats, error = client.stats()
if error:
raise error
fields = (
"cpu_usage", "num_goroutine", "num_msg_published",
"time_api_max", "num_api_requests", "time_client_max",
"time_client_mean", "num_unique_clients", "num_msg_queued",
"memory_sys", "bytes_client_in", "num_cpu", "num_clients",
"time_api_mean", "num_client_requests", "num_msg_sent",
"bytes_client_out", "gomaxprocs", "num_channels"
)
nodes = stats["nodes"]
for node_info in nodes:
name = node_info["name"]
for field in fields:
value = node_info.get(field)
graphite_key = "centrifugo.%s.%s" % (name, field)
send_to_graphite([graphite_key, value, time.time()]) |
lol, just thought that we have a problem with counters with this approach:) UPD: no problem - forgot how metrics in Centrifugo work)) |
Yeah that is equivalent to what Diamond does although diamond already supports remembering the last value of a counter and only submitting the differnce each 10 seconds (or however often it runs). In your setup, sync between when celery job runs and when Centrifugo aggregates could cause races where same values get submitted fro 2 minutes and one real minute get lost. This is why plain counters are more general and the external tool should manage figuring out rate of change base don when it actually executes. |
Anyway we can keep the 1 min summaries if you want - it does save you from keeping persistent state elsewhere in your setup. |
From gitter discussion: First option would be to just keep single
But this has one unpleasant drawback: raw counts from non-local nodes will be subtly different semantically from the local node since they will only update once every ping interval. It's also generally quite messy. The alternative which I think is cleaner and I prefer is to add a new method
Either way we should also:
|
@banks created separate issue for timers, closing this one, thanks a lot! |
I'm looking into the metrics that Centrifugo exposes via
stats
api call as we move to put it into production.I have a couple of issues that are not show-stoppers but I think can be better:
name
to _, That's not just fiddly, it feels like a hack to rely on knowing how centrifugo chooses to name nodes, and if--name
is given at runtime in someone's setup it will break. Relying on--name
is marginally better but requires that your centrifugo config and diamond config agree on the name to look for, and that you coordinate unique names across the cluster somehow. All of this is possible with good config management but it seems way to hard for something so simple.node_metrics_interval
). Which is OK if you are consuming them directly, but for counters it makes it impossible for external process like diamond to track actual number of requests in between polls. I.e. you can't get better than configured granularity.node_metrics_interval
of a moving decaying sample... So what you actually see is the "recent weighted" mean/max value as of up to 1 minute ago.As well as that, there is also a minor race condition:
updateMetricsOnce
locks the metrics mutex and copies values from counters and histogramsProposal
Here is one option that I think solves all my issues, it's quite a big change and there are certainly variations that would still work for our case if there are downsides I don't see with this solution.
go-metrics
. It's useful and has some interesting properties, but we are already losing almost all the value by adding the interval based update to publically visible values. In addition, sampled histograms are wrong way to measure latency, and without those, the rest of what we use can just be replaced with integer counters.local_node_only
param tostats
OR create new api endpointlocalstats
node_metrics_interval
and periodic updating - just return current counter values on each call tostats
(or each ping message sent to other servers)The text was updated successfully, but these errors were encountered: