-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a proposal for monitoring cluster performance #18020
Conversation
This is a doc which I hope to fill up with details gathered from feedback. |
Labelling this PR as size/L |
GCE e2e test build/test passed for commit ce10566b30f7e282a1d7ebd4fc1f57792e1b444e. |
GCE e2e test build/test passed for commit 250b218cf931f69fae22a7d50647b472b8051942. |
Thanks @gmarek for raising up this issue. This is super helpful in performance and scalability improvement work! IMHO, there are a couple of metrics that would be very useful in our performance debugging:
Thanks again! |
|
||
Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0 | ||
cut. In the end it turned out the be caused by `--v=4` (instead of `--v=2`) flag in the scheduler together with the flag `--alsologtostderr` which disables batching of log | ||
lines. This caused wired behavior of the whole component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/wired/weird/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- while it was initially due to running with
--v=4
, it came up again due to some logging statements that had no V level applied, and couldn't be filtered away even with--v=1
(hence Ensure Logging Conventions Are Implemented #17449) - we were running with the default settings (equivalent to
--logtostderr=true
); I notice everyone keeps talking about--alsologtostderr=true
which is a non-default setting that only applies if you're running with with a non-default setting of--log-dir=some-nonempty-string
... point being, running with the defaults is what led us here, but everyone seems to be implying they use non-default settings; should the defaults be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - I'm just used to --alsologtostderr
flag.
cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability |
@hongchaodeng - thanks for the comment. I incorporated first two suggestions. As for scheduler - I completely agree that it's probably the most problematic component now - I don't know if @wojtek-t agree but that's my observation with running experiments on big clusters, but I want to be a bit more specific about what to measure. Scheduler being biggest problem is a new situation for us, which means we're getting better:) Please let me know what bottlenecks you'll find and ideas what metrics should be in place to avoid them in future. |
GCE e2e build/test failed for commit e8b32b1b3f0def9d4364f22db022614fd300d938. |
|
||
### Rate limit monitoring | ||
|
||
Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've also added backoff metrics as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by the backoff metrics?
Is this really a doc we're going to merge or more of a hit list list of action items. |
cc @jimmidyson |
@timothysc I think we should merge it with the intent of changing it to a comprehensive monitoring doc when we have most of it done. I'd also want to leave 'postmortem' part of it. |
GCE e2e test build/test passed for commit bb82299. |
This generally LGTM. |
@k8s-bot test this Tests are more than 48 hours old. Re-running tests. |
GCE e2e build/test failed for commit bb82299. |
@k8s-bot test this |
Automatic merge from submit-queue |
Auto commit by PR queue bot
GCE e2e test build/test passed for commit bb82299. |
cc @wojtek-t @fgrzadkowski @lavalamp @davidopp @dchen1107 @yujuhong @timothysc @spiffxp @xiang90 @hongchaodeng @bgrant0607