Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a proposal for monitoring cluster performance #18020

Merged
merged 1 commit into from
Dec 14, 2015

Conversation

gmarek
Copy link
Contributor

@gmarek gmarek commented Dec 1, 2015

@gmarek gmarek added kind/documentation Categorizes issue or PR as related to documentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Dec 1, 2015
@gmarek
Copy link
Contributor Author

gmarek commented Dec 1, 2015

This is a doc which I hope to fill up with details gathered from feedback.

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 1, 2015
@k8s-github-robot
Copy link

Labelling this PR as size/L

@k8s-bot
Copy link

k8s-bot commented Dec 1, 2015

GCE e2e test build/test passed for commit ce10566b30f7e282a1d7ebd4fc1f57792e1b444e.

@k8s-bot
Copy link

k8s-bot commented Dec 1, 2015

GCE e2e test build/test passed for commit 250b218cf931f69fae22a7d50647b472b8051942.

@hongchaodeng
Copy link
Contributor

Thanks @gmarek for raising up this issue. This is super helpful in performance and scalability improvement work!

IMHO, there are a couple of metrics that would be very useful in our performance debugging:

  • queue length. There are queues (cache.FIFO) used in a few places. They mostly serve as buffer. We can gain great insights from the length of buffer -- if buffer has more and more items, it means downstream is full. For example, in our scheduler testing, we added a Len() to cache.FIFO and printed PodQueue length in
    NextPod: func() *api.Pod {
  • WaitGroup speed. It waits! If there is long tail, then some metrics could show you something. For example, in debugging performance, we printed out how long it waits in
  • Scheduler metrics. Currently we have metrics on scheduler e2e latency. As we found scheduler is taking 60-200ms in 1k nodes cluster, we might need more insights into what takes too long. @xiang90 did some benchmark and profiling and found something. Meanwhile, we expect to see more fine-grained metrics in scheduler to prevent regression.

Thanks again!


Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0
cut. In the end it turned out the be caused by `--v=4` (instead of `--v=2`) flag in the scheduler together with the flag `--alsologtostderr` which disables batching of log
lines. This caused wired behavior of the whole component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/wired/weird/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments:

  • while it was initially due to running with --v=4, it came up again due to some logging statements that had no V level applied, and couldn't be filtered away even with --v=1 (hence Ensure Logging Conventions Are Implemented #17449)
  • we were running with the default settings (equivalent to --logtostderr=true); I notice everyone keeps talking about --alsologtostderr=true which is a non-default setting that only applies if you're running with with a non-default setting of --log-dir=some-nonempty-string... point being, running with the defaults is what led us here, but everyone seems to be implying they use non-default settings; should the defaults be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - I'm just used to --alsologtostderr flag.

@bgrant0607 bgrant0607 assigned a-robinson and unassigned bgrant0607 Dec 1, 2015
@ncdc
Copy link
Member

ncdc commented Dec 2, 2015

cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability

@gmarek
Copy link
Contributor Author

gmarek commented Dec 2, 2015

@hongchaodeng - thanks for the comment. I incorporated first two suggestions.

As for scheduler - I completely agree that it's probably the most problematic component now - I don't know if @wojtek-t agree but that's my observation with running experiments on big clusters, but I want to be a bit more specific about what to measure. Scheduler being biggest problem is a new situation for us, which means we're getting better:) Please let me know what bottlenecks you'll find and ideas what metrics should be in place to avoid them in future.

@k8s-bot
Copy link

k8s-bot commented Dec 2, 2015

GCE e2e build/test failed for commit e8b32b1b3f0def9d4364f22db022614fd300d938.


### Rate limit monitoring

Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've also added backoff metrics as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the backoff metrics?

@timothysc
Copy link
Member

Is this really a doc we're going to merge or more of a hit list list of action items.

@jayunit100
Copy link
Member

cc @jimmidyson

@gmarek
Copy link
Contributor Author

gmarek commented Dec 11, 2015

@timothysc I think we should merge it with the intent of changing it to a comprehensive monitoring doc when we have most of it done. I'd also want to leave 'postmortem' part of it.

@k8s-bot
Copy link

k8s-bot commented Dec 11, 2015

GCE e2e test build/test passed for commit bb82299.

@wojtek-t
Copy link
Member

This generally LGTM.
Let's merge it and we can update it if needed later.

@wojtek-t wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. e2e-not-required labels Dec 14, 2015
@k8s-github-robot
Copy link

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

@k8s-bot
Copy link

k8s-bot commented Dec 14, 2015

GCE e2e build/test failed for commit bb82299.

@gmarek
Copy link
Contributor Author

gmarek commented Dec 14, 2015

@k8s-bot test this

@k8s-github-robot
Copy link

Automatic merge from submit-queue

k8s-github-robot pushed a commit that referenced this pull request Dec 14, 2015
Auto commit by PR queue bot
@k8s-github-robot k8s-github-robot merged commit 0fba3e4 into kubernetes:master Dec 14, 2015
@k8s-bot
Copy link

k8s-bot commented Dec 14, 2015

GCE e2e test build/test passed for commit bb82299.

@gmarek gmarek deleted the doc branch March 17, 2016 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.