Add a proposal for monitoring cluster performance #18020

gmarek · 2015-12-01T14:57:11Z

cc @wojtek-t @fgrzadkowski @lavalamp @davidopp @dchen1107 @yujuhong @timothysc @spiffxp @xiang90 @hongchaodeng @bgrant0607

gmarek · 2015-12-01T14:59:16Z

This is a doc which I hope to fill up with details gathered from feedback.

k8s-github-robot · 2015-12-01T15:06:32Z

Labelling this PR as size/L

k8s-bot · 2015-12-01T15:26:35Z

GCE e2e test build/test passed for commit ce10566b30f7e282a1d7ebd4fc1f57792e1b444e.

k8s-bot · 2015-12-01T16:23:16Z

GCE e2e test build/test passed for commit 250b218cf931f69fae22a7d50647b472b8051942.

hongchaodeng · 2015-12-01T17:59:15Z

Thanks @gmarek for raising up this issue. This is super helpful in performance and scalability improvement work!

IMHO, there are a couple of metrics that would be very useful in our performance debugging:

queue length. There are queues (cache.FIFO) used in a few places. They mostly serve as buffer. We can gain great insights from the length of buffer -- if buffer has more and more items, it means downstream is full. For example, in our scheduler testing, we added a Len() to cache.FIFO and printed PodQueue length in

kubernetes/plugin/pkg/scheduler/factory/factory.go

Line 230 in f4c5d00

NextPod: func() *api.Pod {
WaitGroup speed. It waits! If there is long tail, then some metrics could show you something. For example, in debugging performance, we printed out how long it waits in

kubernetes/pkg/controller/replication/replication_controller.go

Line 368 in b6ef62a

wait.Wait()
Scheduler metrics. Currently we have metrics on scheduler e2e latency. As we found scheduler is taking 60-200ms in 1k nodes cluster, we might need more insights into what takes too long. @xiang90 did some benchmark and profiling and found something. Meanwhile, we expect to see more fine-grained metrics in scheduler to prevent regression.

Thanks again!

lavalamp · 2015-12-01T18:53:42Z

docs/proposals/performance-related-monitoring.md

+
+Issue https://github.com/kubernetes/kubernetes/issues/14216 was opened because @spiffxp observed a regression in scheduler performance in 1.1 branch in comparison to `old` 1.0
+cut. In the end it turned out the be caused by `--v=4` (instead of `--v=2`) flag in the scheduler together with the flag `--alsologtostderr` which disables batching of log
+lines. This caused wired behavior of the whole component.


s/wired/weird/

Two comments:

while it was initially due to running with --v=4, it came up again due to some logging statements that had no V level applied, and couldn't be filtered away even with --v=1 (hence Ensure Logging Conventions Are Implemented #17449)

we were running with the default settings (equivalent to --logtostderr=true); I notice everyone keeps talking about --alsologtostderr=true which is a non-default setting that only applies if you're running with with a non-default setting of --log-dir=some-nonempty-string... point being, running with the defaults is what led us here, but everyone seems to be implying they use non-default settings; should the defaults be changed?

You're right - I'm just used to --alsologtostderr flag.

ncdc · 2015-12-02T14:38:26Z

cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability

gmarek · 2015-12-02T15:25:49Z

@hongchaodeng - thanks for the comment. I incorporated first two suggestions.

As for scheduler - I completely agree that it's probably the most problematic component now - I don't know if @wojtek-t agree but that's my observation with running experiments on big clusters, but I want to be a bit more specific about what to measure. Scheduler being biggest problem is a new situation for us, which means we're getting better:) Please let me know what bottlenecks you'll find and ideas what metrics should be in place to avoid them in future.

k8s-bot · 2015-12-02T16:51:43Z

GCE e2e build/test failed for commit e8b32b1b3f0def9d4364f22db022614fd300d938.

timothysc · 2015-12-02T17:58:15Z

docs/proposals/performance-related-monitoring.md

+
+### Rate limit monitoring
+
+Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of


We've also added backoff metrics as well.

What do you mean by the backoff metrics?

timothysc · 2015-12-02T18:04:35Z

Is this really a doc we're going to merge or more of a hit list list of action items.

jayunit100 · 2015-12-11T13:49:42Z

cc @jimmidyson

gmarek · 2015-12-11T14:02:28Z

@timothysc I think we should merge it with the intent of changing it to a comprehensive monitoring doc when we have most of it done. I'd also want to leave 'postmortem' part of it.

k8s-bot · 2015-12-11T14:28:25Z

GCE e2e test build/test passed for commit bb82299.

wojtek-t · 2015-12-14T08:56:46Z

This generally LGTM.
Let's merge it and we can update it if needed later.

k8s-github-robot · 2015-12-14T09:06:34Z

@k8s-bot test this

Tests are more than 48 hours old. Re-running tests.

k8s-bot · 2015-12-14T09:36:00Z

GCE e2e build/test failed for commit bb82299.

gmarek · 2015-12-14T09:54:15Z

@k8s-bot test this

k8s-github-robot · 2015-12-14T10:28:03Z

Automatic merge from submit-queue

Auto commit by PR queue bot

k8s-bot · 2015-12-14T10:44:03Z

GCE e2e test build/test passed for commit bb82299.

gmarek added kind/documentation Categorizes issue or PR as related to documentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Dec 1, 2015

googlebot added the cla: yes label Dec 1, 2015

k8s-github-robot assigned bgrant0607 Dec 1, 2015

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 1, 2015

gmarek force-pushed the doc branch from ce10566 to 250b218 Compare December 1, 2015 15:40

lavalamp reviewed Dec 1, 2015
View reviewed changes

bgrant0607 assigned a-robinson and unassigned bgrant0607 Dec 1, 2015

gmarek force-pushed the doc branch from 250b218 to e8b32b1 Compare December 2, 2015 16:18

timothysc reviewed Dec 2, 2015
View reviewed changes

gmarek assigned wojtek-t and unassigned a-robinson Dec 4, 2015

wojtek-t mentioned this pull request Dec 11, 2015

Create a dashboard to monitor metrics from our scalability test runs #18570

Closed

Add a proposal for monitoring cluster performance

bb82299

gmarek force-pushed the doc branch from e8b32b1 to bb82299 Compare December 11, 2015 13:55

wojtek-t added lgtm "Looks good to me", indicates that a PR is ready to be merged. e2e-not-required labels Dec 14, 2015

k8s-github-robot pushed a commit that referenced this pull request Dec 14, 2015

Merge pull request #18020 from gmarek/doc

0fba3e4

Auto commit by PR queue bot

k8s-github-robot merged commit 0fba3e4 into kubernetes:master Dec 14, 2015

gmarek deleted the doc branch March 17, 2016 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a proposal for monitoring cluster performance #18020

Add a proposal for monitoring cluster performance #18020

gmarek commented Dec 1, 2015

gmarek commented Dec 1, 2015

k8s-github-robot commented Dec 1, 2015

k8s-bot commented Dec 1, 2015

k8s-bot commented Dec 1, 2015

hongchaodeng commented Dec 1, 2015

lavalamp Dec 1, 2015

spiffxp Dec 1, 2015

gmarek Dec 2, 2015

ncdc commented Dec 2, 2015

gmarek commented Dec 2, 2015

k8s-bot commented Dec 2, 2015

timothysc Dec 2, 2015

gmarek Dec 11, 2015

timothysc commented Dec 2, 2015

jayunit100 commented Dec 11, 2015

gmarek commented Dec 11, 2015

k8s-bot commented Dec 11, 2015

wojtek-t commented Dec 14, 2015

k8s-github-robot commented Dec 14, 2015

k8s-bot commented Dec 14, 2015

gmarek commented Dec 14, 2015

k8s-github-robot commented Dec 14, 2015

k8s-bot commented Dec 14, 2015


		### Rate limit monitoring

		Reverse of REST call monitoring done in the API server. We need to know when a given component increases a pressure it puts on the API server. As a proxy for number of

Add a proposal for monitoring cluster performance #18020

Add a proposal for monitoring cluster performance #18020

Conversation

gmarek commented Dec 1, 2015

gmarek commented Dec 1, 2015

k8s-github-robot commented Dec 1, 2015

k8s-bot commented Dec 1, 2015

k8s-bot commented Dec 1, 2015

hongchaodeng commented Dec 1, 2015

lavalamp Dec 1, 2015

Choose a reason for hiding this comment

spiffxp Dec 1, 2015

Choose a reason for hiding this comment

gmarek Dec 2, 2015

Choose a reason for hiding this comment

ncdc commented Dec 2, 2015

gmarek commented Dec 2, 2015

k8s-bot commented Dec 2, 2015

timothysc Dec 2, 2015

Choose a reason for hiding this comment

gmarek Dec 11, 2015

Choose a reason for hiding this comment

timothysc commented Dec 2, 2015

jayunit100 commented Dec 11, 2015

gmarek commented Dec 11, 2015

k8s-bot commented Dec 11, 2015

wojtek-t commented Dec 14, 2015

k8s-github-robot commented Dec 14, 2015

k8s-bot commented Dec 14, 2015

gmarek commented Dec 14, 2015

k8s-github-robot commented Dec 14, 2015

k8s-bot commented Dec 14, 2015