Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node statistics improvements #222

Closed
Sannis opened this issue May 25, 2018 · 9 comments
Closed

Node statistics improvements #222

Sannis opened this issue May 25, 2018 · 9 comments

Comments

@Sannis
Copy link
Contributor

Sannis commented May 25, 2018

During setup centrifugo on production I faced some lack of statistics that we usually use for monitoring and telemetry.

Most important of them are:

  • running binary version
  • cpu rusage stats (currently there is only node_cpu_usage without users/system split and also it show ps output so is not very accurate)

Are you have any plans about adding such functionality in 1.x version? In not, I can volunteer for that.

P.S. Anyway thanks @FZambia for great product.

@FZambia
Copy link
Member

FZambia commented May 25, 2018

Hello @Sannis , thanks!

Regarding to CPU usage - Iooks like measuring this inside application itself is hacky, not accurate and not a convenient practice. This was done using ps because it was the simplest cross-platform way I found without using CGO at that moment. Also in today's container world CPU must be measured per-container from outside as there is no such info inside container. CPU usage stats will be removed in v2 (at least until someone knows how to do this right).

Binary version - this is useful, at moment stats look like this:

"body": {
    "data": {
        "nodes": [
            {
                "uid": "8a7cf241-62c3-4ef5-8a20-83fbadb74c6f",
                "name": "Alexanders-MacBook-Pro.local_8000",
                "started_at": 1480764329,
                "metrics": {
                    "client_api_15_count": 0,
                    ...
                    "node_uptime_seconds": 120
                }
            }
        ],
        "metrics_interval": 60
    }
}

I think the correct way for Centrifugo v1.x will be adding version field on top level of each node info i.e.:

"body": {
    "data": {
        "nodes": [
            {
                "uid": "8a7cf241-62c3-4ef5-8a20-83fbadb74c6f",
                "version": "1.7.10",
                "name": "Alexanders-MacBook-Pro.local_8000",
                "started_at": 1480764329,
                "metrics": {
                    "client_api_15_count": 0,
                    ...
                    "node_uptime_seconds": 120
                }
            }
        ],
        "metrics_interval": 60
    }
}

In version 2 Centrifugo will have different metrics mechanics - Prometheus and automatic Graphite exporter - so this must be ported to v2 in a bit different way - using labels, sth like in Prometheus itself:

prometheus_build_info{branch="HEAD",goversion="go1.9.2",revision="cd5e2fe68741a60ba90b80fbc161d0581be9e8bd",version="2.2.0-rc.1"} 1

What do you think?

Sannis added a commit to badoo/centrifugo that referenced this issue May 28, 2018
Sannis added a commit to badoo/centrifugo that referenced this issue May 30, 2018
@Sannis
Copy link
Contributor Author

Sannis commented May 30, 2018

Regarding to CPU usage - Iooks like measuring this inside application itself is hacky, not accurate and not a convenient practice. This was done using ps because it was the simplest cross-platform way I found without using CGO at that moment. Also in today's container world CPU must be measured per-container from outside as there is no such info inside container. CPU usage stats will be removed in v2 (at least until someone knows how to do this right).

The problem with ps described in man:

CPU usage is currently expressed as the percentage of time spent
running during the entire lifetime of a process. This is not ideal,
and it does not conform to the standards that ps otherwise conforms
to. CPU usage is unlikely to add up to exactly 100%.

So actually I don't know how this metric can be used for real-time monitoring.
This is how ps aux cpu usage percent matches getrusage utime/stime derivative:
centrifugolsstat_requests_for_all_services_ miami _201805292001_201805300458
centrifugolsstat_rusage_for_all_services_ miami _201805292001_201805300458
centrifugolsstat_cpu_usage_for_all_services_ miami _201805292001_201805300458

@FZambia
Copy link
Member

FZambia commented Jun 4, 2018

@Sannis very sorry for a long reply time - was attending Gophercon EU. Thanks for pointing out how ps behaves. As I said above I think Centrifugo should avoid measuring CPU stats itself so I won't backport this to v2 but we can add this to v1 I suppose. Do you know how this syscall behaves on Windows btw?

@mkevac
Copy link

mkevac commented Jun 5, 2018

Do you know how this syscall behaves on Windows btw?

It does not work on Windows AFAIK. But ps does not work either...

As I said above I think Centrifugo should avoid measuring CPU stats itself

Also in today's container world CPU must be measured per-container from outside as there is no such info inside container.

This is very limiting.
Even if we were using containers for centrifugo, we would have to put only centrifugo inside and we would need separate tools for measuring CPU usage. It makes whole infrastucture more complex.

Maybe it's not important for centrifugo and Go, but imagine some service uses several processes or forks periodically. Measuring CPU usage from outside would not be precise enough. It would be per container, not per process. I don't even mention that sometimes it could be useful to measure CPU usage per thread (not for Go though).

@FZambia
Copy link
Member

FZambia commented Jun 5, 2018

I always tried to make Centrifugo production ready out of the box. @mkevac @Sannis so you think that porting rusage utime and stime to v2 makes sense right?

@mkevac
Copy link

mkevac commented Jun 5, 2018

Adding this code in v1 makes sense. I see no reason not to. It is useful and @Sannis is prepared to support it.
Porting this code to v2 is not so useful, because you don't need it and we at Badoo don't use it. It would be a dead code. I would wait until someone asks for it or @Sannis makes a patch :-)

@Sannis
Copy link
Contributor Author

Sannis commented Jun 6, 2018

@FZambia thanks!

I always tried to make Centrifugo production ready out of the box. @mkevac @Sannis so you think that porting rusage utime and stime to v2 makes sense right?

There is gopsutil package that provide such stats not only for *nix systems, if you need it.

@Sannis Sannis closed this as completed Jun 7, 2018
@FZambia
Copy link
Member

FZambia commented Jun 7, 2018

@Sannis thanks for improvements! I will try to make new release soon, there are several small changes in master.

@FZambia
Copy link
Member

FZambia commented Jun 24, 2018

Also pushed a fix for Windows in 991fae6 - syscalls not defined:

libcentrifugo/node/cpu.go:57:9: undefined: syscall.Getrusage
libcentrifugo/node/cpu.go:57:27: undefined: syscall.RUSAGE_SELF
libcentrifugo/node/cpu.go:62:9: undefined: syscall.TimevalToNsec
libcentrifugo/node/cpu.go:62:37: rusage.Utime undefined (type *syscall.Rusage has no field or method Utime)
libcentrifugo/node/cpu.go:62:46: undefined: syscall.TimevalToNsec
libcentrifugo/node/cpu.go:62:74: rusage.Stime undefined (type *syscall.Rusage has no field or method Stime)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants