-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic initial instrumentation of the apiserver #4272
Conversation
The prometheus guys conveniently just merged a fix for the build issue tonight. I'm pulling it into our Godeps in #4275. After that gets merged, I can revert the golang.sh change and this should be good to go from a build health perspective. |
Although I'd still like to know if it's really intended behavior that our e2e/integration tests pull in most of our server code and are expected to run on all client platforms :) |
Merged, feel free to rebase :) |
func init() { | ||
// All Prometheus metrics have to be explicitly registered to appear on the metrics endpoint. | ||
prometheus.MustRegister(apiserverLatencies) | ||
prometheus.MustRegister(redirectCounter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Init each of these in the file they're declared in. Or declare them here? We don't use apiserverLatencies here anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, for some reason I thought there could only be one init function per package, not per file. I was keeping the latencies one here because it's going to be shared across most of the files in the package and this seemed like a central place to put it.
Looks good, minor nits. If we do the defer-monitor a lot we may want to make a pattern out of it. |
Thanks for the feedback, PTAL. I'm not sold that the breakdown of metrics (one common one for latency then separate ones with more detail) or labels (do we really care about latency per response code but not per resource?) makes the most sense, so I'd be happy if there's feedback there as well. Perhaps @nikhiljindal has thoughts on what's most useful to monitor? Otherwise I'll just run forward with this to make progress. |
9a10ef1
to
0bb3fe2
Compare
prometheus library for monitoring, which exports some basic resource usage metrics by default, like number of goroutines, open file descriptors, resident and virtual memory, etc. I've also started adding in request counters and latency histograms, but have only added them to two of our HTTP handlers. If this looks reasonable, I'll add them to the rest in a second PR.
LGTM. Is it possible to have unit tests for Prometheus metrics? |
Name: "apiserver_request_latencies", | ||
Help: "Response latency summary in microseconds for each request handler, verb, and HTTP response code.", | ||
}, | ||
[]string{"handler", "verb", "code"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you envision being the handlers? Today I see redirect
and rest
. I'm not sure what information you're interested in gathering, would breaking down by resource be more useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also watch, proxy, and validate handlers, which I was going to add in a followup PR (although validate isn't too important). I can add them to this one if it would make things easier.
But yeah, I think including resource would be reasonable as well. I'll add it in. I'm just worried about growing the cardinality of time series too far, since a Summary metric actually gets converted into 5 exported metrics (0.5, 0.9, and 0.99 quantiles, plus a sum and a count).
What about breaking down the latency summary just by handler and verb, and then having a separate CounterVec metric with handler, verb, resource, and code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That SGTM. Don't worry about adding the other handlers in this PR.
@vishh It doesn't look like it judging by the lack of any public accessor methods, but I'm asking on their email list - https://groups.google.com/forum/#!topic/prometheus-developers/fOBc22XnNrY |
Should be looking better now. Thanks for the feedback! |
LGTM Build is failing on Shippable, but clicking on the link takes me to their main page so no idea what's failing :-\ |
So are 3 of the 4 most recently filed PRs. Shippable hasn't been very dependable so far.. |
@a-robinson How about adding a TODO to add unit tests once thats possible? |
the "code" label from the latencies SummaryVec.
Done. |
Minus the CIs, I think this is ready to merge. Or do you think it is still a WIP @a-robinson? |
It's ready to merge. |
Basic initial instrumentation of the apiserver
This links in the prometheus library for monitoring, which exports some basic resource usage metrics by default, like number of goroutines, open file descriptors, resident and virtual memory, etc. I've also started adding in request counters and latency histograms, but have only done it to two of our HTTP handlers. If this looks reasonable, I'll add them to the rest of the handlers in a second PR.
Unfortunately, the prometheus client library seems to break the build for two of the client platforms (linux/386 and linux/arm). This shouldn't be merged until I can get that fixed (and remove it from the PR), but I'm sending this out now so that hopefully someone can explain why in the world we're requiring our apiserver package to be compilable on all client platforms. It seems really weird that we pull it in to tests that we require to run on all of them, when the server binary itself only has to run on linux/x86.
Issue #1625
cc @vishh @vmarmol