Node statistics improvements #222

Sannis · 2018-05-25T14:52:17Z

During setup centrifugo on production I faced some lack of statistics that we usually use for monitoring and telemetry.

Most important of them are:

running binary version
cpu rusage stats (currently there is only node_cpu_usage without users/system split and also it show ps output so is not very accurate)

Are you have any plans about adding such functionality in 1.x version? In not, I can volunteer for that.

P.S. Anyway thanks @FZambia for great product.

FZambia · 2018-05-25T16:58:34Z

Hello @Sannis , thanks!

Regarding to CPU usage - Iooks like measuring this inside application itself is hacky, not accurate and not a convenient practice. This was done using ps because it was the simplest cross-platform way I found without using CGO at that moment. Also in today's container world CPU must be measured per-container from outside as there is no such info inside container. CPU usage stats will be removed in v2 (at least until someone knows how to do this right).

Binary version - this is useful, at moment stats look like this:

"body": {
    "data": {
        "nodes": [
            {
                "uid": "8a7cf241-62c3-4ef5-8a20-83fbadb74c6f",
                "name": "Alexanders-MacBook-Pro.local_8000",
                "started_at": 1480764329,
                "metrics": {
                    "client_api_15_count": 0,
                    ...
                    "node_uptime_seconds": 120
                }
            }
        ],
        "metrics_interval": 60
    }
}

I think the correct way for Centrifugo v1.x will be adding version field on top level of each node info i.e.:

"body": {
    "data": {
        "nodes": [
            {
                "uid": "8a7cf241-62c3-4ef5-8a20-83fbadb74c6f",
                "version": "1.7.10",
                "name": "Alexanders-MacBook-Pro.local_8000",
                "started_at": 1480764329,
                "metrics": {
                    "client_api_15_count": 0,
                    ...
                    "node_uptime_seconds": 120
                }
            }
        ],
        "metrics_interval": 60
    }
}

In version 2 Centrifugo will have different metrics mechanics - Prometheus and automatic Graphite exporter - so this must be ported to v2 in a bit different way - using labels, sth like in Prometheus itself:

prometheus_build_info{branch="HEAD",goversion="go1.9.2",revision="cd5e2fe68741a60ba90b80fbc161d0581be9e8bd",version="2.2.0-rc.1"} 1

What do you think?

Sannis · 2018-05-30T10:51:52Z

Regarding to CPU usage - Iooks like measuring this inside application itself is hacky, not accurate and not a convenient practice. This was done using ps because it was the simplest cross-platform way I found without using CGO at that moment. Also in today's container world CPU must be measured per-container from outside as there is no such info inside container. CPU usage stats will be removed in v2 (at least until someone knows how to do this right).

The problem with ps described in man:

CPU usage is currently expressed as the percentage of time spent
running during the entire lifetime of a process. This is not ideal,
and it does not conform to the standards that ps otherwise conforms
to. CPU usage is unlikely to add up to exactly 100%.

So actually I don't know how this metric can be used for real-time monitoring.
This is how ps aux cpu usage percent matches getrusage utime/stime derivative:

FZambia · 2018-06-04T21:13:05Z

@Sannis very sorry for a long reply time - was attending Gophercon EU. Thanks for pointing out how ps behaves. As I said above I think Centrifugo should avoid measuring CPU stats itself so I won't backport this to v2 but we can add this to v1 I suppose. Do you know how this syscall behaves on Windows btw?

mkevac · 2018-06-05T18:46:13Z

Do you know how this syscall behaves on Windows btw?

It does not work on Windows AFAIK. But ps does not work either...

As I said above I think Centrifugo should avoid measuring CPU stats itself

Also in today's container world CPU must be measured per-container from outside as there is no such info inside container.

This is very limiting.
Even if we were using containers for centrifugo, we would have to put only centrifugo inside and we would need separate tools for measuring CPU usage. It makes whole infrastucture more complex.

Maybe it's not important for centrifugo and Go, but imagine some service uses several processes or forks periodically. Measuring CPU usage from outside would not be precise enough. It would be per container, not per process. I don't even mention that sometimes it could be useful to measure CPU usage per thread (not for Go though).

FZambia · 2018-06-05T21:21:51Z

I always tried to make Centrifugo production ready out of the box. @mkevac @Sannis so you think that porting rusage utime and stime to v2 makes sense right?

mkevac · 2018-06-05T21:38:34Z

Adding this code in v1 makes sense. I see no reason not to. It is useful and @Sannis is prepared to support it.
Porting this code to v2 is not so useful, because you don't need it and we at Badoo don't use it. It would be a dead code. I would wait until someone asks for it or @Sannis makes a patch :-)

Sannis · 2018-06-06T09:19:13Z

@FZambia thanks!

I always tried to make Centrifugo production ready out of the box. @mkevac @Sannis so you think that porting rusage utime and stime to v2 makes sense right?

There is gopsutil package that provide such stats not only for *nix systems, if you need it.

FZambia · 2018-06-07T20:25:48Z

@Sannis thanks for improvements! I will try to make new release soon, there are several small changes in master.

FZambia · 2018-06-24T15:10:14Z

Also pushed a fix for Windows in 991fae6 - syscalls not defined:

libcentrifugo/node/cpu.go:57:9: undefined: syscall.Getrusage
libcentrifugo/node/cpu.go:57:27: undefined: syscall.RUSAGE_SELF
libcentrifugo/node/cpu.go:62:9: undefined: syscall.TimevalToNsec
libcentrifugo/node/cpu.go:62:37: rusage.Utime undefined (type *syscall.Rusage has no field or method Utime)
libcentrifugo/node/cpu.go:62:46: undefined: syscall.TimevalToNsec
libcentrifugo/node/cpu.go:62:74: rusage.Stime undefined (type *syscall.Rusage has no field or method Stime)

Sannis added a commit to badoo/centrifugo that referenced this issue May 28, 2018

add version to node info, refs centrifugal#222

10c95cf

FZambia pushed a commit that referenced this issue May 28, 2018

add version to node info, refs #222 (#224)

51537f5

Sannis added a commit to badoo/centrifugo that referenced this issue May 30, 2018

use getrusage for cpu stats, refs centrifugal#222

78e2435

Sannis mentioned this issue May 30, 2018

Add getrusage to get separate user/system CPU usage times #225

Merged

FZambia pushed a commit that referenced this issue Jun 5, 2018

use getrusage for cpu stats, refs #222 (#225)

afa1127

Sannis closed this as completed Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node statistics improvements #222

Node statistics improvements #222

Sannis commented May 25, 2018

FZambia commented May 25, 2018 •

edited

Loading

Sannis commented May 30, 2018

FZambia commented Jun 4, 2018

mkevac commented Jun 5, 2018

FZambia commented Jun 5, 2018

mkevac commented Jun 5, 2018

Sannis commented Jun 6, 2018 •

edited

Loading

FZambia commented Jun 7, 2018

FZambia commented Jun 24, 2018

Node statistics improvements #222

Node statistics improvements #222

Comments

Sannis commented May 25, 2018

FZambia commented May 25, 2018 • edited Loading

Sannis commented May 30, 2018

FZambia commented Jun 4, 2018

mkevac commented Jun 5, 2018

FZambia commented Jun 5, 2018

mkevac commented Jun 5, 2018

Sannis commented Jun 6, 2018 • edited Loading

FZambia commented Jun 7, 2018

FZambia commented Jun 24, 2018

FZambia commented May 25, 2018 •

edited

Loading

Sannis commented Jun 6, 2018 •

edited

Loading