99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954

alex-mohr · 2015-01-29T22:23:01Z

@davidopp Scheduler throughput is ~1 pod per second; should be >> 1 pod/sec.

In 0.8.1, part of #3936, I quickly created 10 kubernetes/pause pods with unique names by running a bunch of simultaneous cmdlines. I then repeatedly ran pods list, saw all 10 pods, then watched as hostname is slowly filled in at the rate of ~1 pod per second.

% gcloud preview container pods --cluster=${CLUSTER} create --name=$POD --image=kubernetes/pause --port=${PORT} &

alex-mohr · 2015-01-29T22:27:04Z

FWIW, repeating the experiment with 30 pause pods required 53 seconds until the last pod went from unassigned to having a hostname. 50 pods was ~90 seconds.

alex-mohr · 2015-01-29T22:43:29Z

Changing my mind -- P1 as usability issue and also as something that makes it hard to find the reliability bottlenecks for clusters that are used over time.

brendandburns · 2015-01-29T22:51:21Z

I think a more effective data measurement is the latency distribution,
rather than the throughput. I sincerely doubt that we're going to have a
need for high throughput. But if the latency is high (or variable) we
should fix that.

--brendan

On Thu, Jan 29, 2015 at 2:43 PM, Alex Mohr notifications@github.com wrote:

Changing my mind -- P1 as usability issue and also as something that makes
it hard to find the reliability bottlenecks for clusters that are used over
time.

—
Reply to this email directly or view it on GitHub
#3954 (comment)
.

alex-mohr · 2015-01-29T23:02:48Z

@brendandburns Latency to reporting running for a single pod is also high #3952 but that appears not to be attributable to the scheduler.

Re: throughput, unless you're scheduling only 1 pod a time, it manifests as latency. That means our unit tests will run more slowly, developers take longer to test stuff, the usability is worse, and testing how a cluster does with a cumulative total of N containers launched takes a long time. It'd be nice if we could run something like a soak test faster than waiting days/weeks of realtime.

davidopp · 2015-02-06T20:09:53Z

At the 1.0 roadmap meeting the following requirement was decided for 1.0: "99% of scheduling decisions made in less than 1s on 100 node, 3000 pod clusters; linear time to number of pods and nodes"

davidopp · 2015-02-25T06:24:56Z

There was some interest expressed today in also having a latency goal.

lavalamp · 2015-03-06T21:58:13Z

I just did 100 pods, it was 4.5 minutes for all of them to schedule. HOWEVER, it is unclear if this is due to the scheduler being slow or apiserver being slow; I expect if I profile scheduler I will find that it is spending most of its time waiting.

yujuhong · 2015-03-07T00:04:23Z

Out of curiosity, I ran a simple test using a local cluster, where the only node is the localhost. With 4 threads each creating 25 pods (i.e. 100 pods), the total pod creation time + scheduling time (i.e., the host field is set) is ~1 minute. This is probably close to the maximum performance of what a single-threaded scheduler can achieve in the current implementation.

@davidopp, what machine/cluster setup should we use for the 1.0 requirement?

davidopp · 2015-03-07T00:13:29Z

The setup we had decided on earlier was "100 node, 3000 pod clusters"
So I think doing it on one node is cheating. :)

yujuhong · 2015-03-07T00:20:46Z

Yes, it is cheating, but it's still a good reference point :)

@lavalamp, did you create a 100-node cluster for your test? It'd be a pain to wait for the cluster bring-up/tear-down with so many nodes. Or did you mock-out the nodes...?

davidopp · 2015-03-07T00:22:42Z

@yujuhong Yes I agree, and in fact trying the same number of pods at different cluster sizes is helpful to try to pinpoint where the slowdowns are coming from.

lavalamp · 2015-03-07T00:27:32Z

This was on a two node (+ master VM) cluster (so 100 pods per node, using the kubernetes/pause image).

lavalamp · 2015-03-11T21:43:14Z

Preliminary result from running with #5320 is that scheduler performance is much better with boundPods gone (density test ran 2x as fast). ratio of apiserver's CPU usage to scheduler's is about 9:1...

gmarek · 2015-06-17T07:35:57Z

Yes, I did it yesterday. It looks better than I expected, as it turned out that I was "lucky" when picking Pods to analyze. Statistics are that for saturated cluster 50% is around 3s, 90% around 6s, 99% around 9s for scheduled to running, and 6/10/13 for scheduled to watch. This times are computed after Pod is scheduled, so it does not include scheduler latency (assuming that the watch latency is similar to ones I observe for e2e it's at least 3/5/6s).

I'll run multiple tests for saturated cluster to check the variance of results, and run sequential test, to approximate scheduler latencies.

bprashanth · 2015-06-17T14:36:38Z

Yes, I did it yesterday. It looks better than I expected, as it turned out that I was "lucky" when picking Pods to analyze.

We also got a couple of changes in recently that should help. If you ran the test yesterday it should have them (/stats don't hit docker, setting the right oom score on cluster addons so nodes don't die -- if the node went down and came up previously it would take the kubelet/docker a good minute to reinitialize things --). We're also setting a cgroup limit on docker to avoid cases where it can hog memory causing everything else to get really slow (#9881).

gmarek · 2015-06-17T15:15:49Z

@bprashanth good to know. Statistics are starting to look promising.
TL;DR; Variance of results for saturated cluster is pretty big. Schedule latencies are what can be expected when scheduling 100 new pods and having 10pod/s rate limit in place. As for Kubelet it looks that it quite OK, i.e. scales ~linearly from the number of Pods, but for saturated cluster it takes too much time to start a new Pod. Also watch is slow-ish (usually), with very high variance.

I suggest defining our goal in term of latency between schedule time and container start times (exclude schedule latency because of rate limiting and end-user watch). We're already accounting for one watch latency (scheduler->kubelet), and we have only limited control on how efficient etcd watch is.

There are few timestamps I look at:
(1) Pod creation time (time when RC created the Pod first time)
(2) Pod schedule time (time when scheduler wrote an Event for Pod scheduling)
(3) Pod startup time (time when all Pods containers are running)
(4) Pod watch time (time when external watch notices that Pod containers are running)

to-schedule := (2) - (1)
from-schedule := (3) - (2)
watch := (4) - (3)
schedule-watch := (4) - (2)
total := (4) - (1)

Current results are:

	3	10	20	30
1perc50 to-schedule	5	5	5	5
1perc90 to-schedule	7	7	7	7
1perc99 to-schedule	8	8	8	7
1perc50 from-schedule	1	1	2	3
1perc90 from-schedule	1	2	4	6
1perc99 from-schedule	2	3	7	9
1perc50 watch	2.4	0.88	0.88	3.36
1perc90 watch	3.56	1.36	1.45	4.83
1perc99 watch	5.25	4.28	1.99	5.83
1perc50 schedule-watch	3.23	2.09	2.85	6.36
1perc90 schedule-watch	4.25	2.94	5.18	10.16
1perc99 schedule-watch	7.25	5.28	8.69	12.98
1perc50 total	7.82	7.2	8.28	11.16
1perc90 total	10.24	9.4	10.65	15.44
1perc99 total	11.56	9.94	12.78	18.98
2perc50 to-schedule	5	4	4	5
2perc90 to-schedule	7	7	7	8
2perc99 to-schedule	8	8	7	8
2perc50 from-schedule	0	1	2	2
2perc90 from-schedule	1	2	4	4
2perc99 from-schedule	1	4	7	5
2perc50 watch	2.27	0.88	0.94	0.87
2perc90 watch	3.48	1.32	1.35	1.47
2perc99 watch	3.83	2.24	1.97	1.89
2perc50 schedule-watch	2.83	1.91	2.87	2.96
2perc90 schedule-watch	4.13	2.82	5	4.62
2perc99 schedule-watch	4.83	5.87	8.19	6.75
2perc50 total	7.48	6.36	7.85	8.21
2perc90 total	10.13	8.64	10.28	10.61
2perc99 total	11.65	9.4	14.19	13.75
3perc50 to-schedule	5	5	5	6
3perc90 to-schedule	7	8	7	8
3perc99 to-schedule	7	9	8	9
3perc50 from-schedule	1	1	2	3
3perc90 from-schedule	1	2	4	5
3perc99 from-schedule	2	4	6	9
3perc50 watch	2.49	0.9	0.9	0.8
3perc90 watch	3.49	1.44	1.3	1.38
3perc99 watch	4.1	2.04	2.22	2.43
3perc50 schedule-watch	2.77	2.11	2.97	3.86
3perc90 schedule-watch	3.79	3.14	4.89	6.27
3perc99 schedule-watch	4.79	4.69	7.14	10.61
3perc50 total	8.1	7.14	8.43	9.43
3perc90 total	9.49	10.01	10.69	12.79
3perc99 total	10.79	11.69	12.92	15.24

bprashanth · 2015-06-17T15:56:07Z

My 2c

Watch latency: The variance here goes back to our 'all apiserver operations need to return in 1s' goal. We're clearly not meeting it for watch (time between mutation and watch notification < 1s). We have optimizations planned post 1.0, but I don't think it'll get better before that.
Scheduling latency: if we evolve the test to do what we initially discussed (schedule 1 pod instead of an rc with 100) the scheduling latency should drop to 0.

gmarek · 2015-06-17T16:18:02Z

I run serialized version of the test and for some reason it's even worse (12/22/25). It may be a bug in the modified test, but I didn't manage to look into it. What's worse I'll be on conferences for next week and the half, so I won't have too much time to dig in.

bprashanth · 2015-06-19T00:29:23Z

cc @bgrant0607 who was curious about why we needed StartTime to measure watch latency of specific pods when we have /metrics (spinoff from #10066).

Pure prometheus metrics aren't as helpful for this benchmark because we need to bucket them, and it's hard to isolate the last 100 pods that we startup after we've setup a 3000 pod cluster

bgrant0607 · 2015-06-19T00:41:35Z

Why won't a sufficiently small percentile work? Do we expect the first 3000 pods to have slower startup time than the last 100?

bprashanth · 2015-06-19T00:42:35Z

The first 3000 is known to have slow startup time and put enough duress on the apiserver to also have slow watch time (docker/etcd suffer under that initial 3000 pod load load)

bgrant0607 · 2015-06-19T01:04:58Z

Do we have reason to believe that Kubelet watch behavior is different than that for other watch clients?

bprashanth · 2015-06-19T01:11:21Z

That's what we're doing right now (setting up an external watcher to measure latency), but this is to debug after we've observed a slow startup, eg: cpu/memory on the kubelet can certainly lead to the watch goroutine just not running in cases (that's what was happening with #9788). It's not like we can't get by without Starttime, just that if there was a field that had this specific timestamp, it would be useful.

bgrant0607 · 2015-06-19T02:24:58Z

Another idea: Kill Kubelet after creating the 3000 pods. That would reset any of its monitoring stats.

bprashanth · 2015-06-19T02:27:16Z

Haha, yeah, restarting things would reset stats. It's something I'd consider as a last resort (because restarting kubelet will do a lot more -- it would lead to a thundering herd of relist etc).

gmarek · 2015-06-19T07:24:49Z

@bgrant0607 FYI we currently have data redundancy in a form of Pod.StartTime and max_containers(Pod."container".Status.Phase.Running.StartedAt), which are equal (up to few milliseconds, truncated to seconds).

piosz · 2015-06-19T13:07:23Z

I've made a similar experiment to what @gmarek did but with pod creation limit 5 pods/sec (#10020) and results are much better:

	3	10	20	30
1perc50 to-schedule	0	0	0	0
1perc90 to-schedule	1	0	1	1
1perc99 to-schedule	1	1	1	1
1perc50 from-schedule	1	1	2	2
1perc90 from-schedule	1	2	3	3
1perc99 from-schedule	1	3	5	5
1perc50 watch	0.67	0.70	0.69	0.71
1perc90 watch	1.06	1.07	1.04	1.03
1perc99 watch	2.72	1.51	1.48	1.93
1perc50 schedule-watch	1.20	1.83	2.75	2.81
1perc90 schedule-watch	1.69	2.53	4.01	4.03
1perc99 schedule-watch	2.72	4.51	6.08	6.93
1perc50 total	1.43	1.85	2.87	2.93
1perc90 total	1.81	2.53	4.14	4.05
1perc99 total	3.72	4.51	6.08	6.93
2perc50 total	1.46	1.99	2.80	4.21
2perc90 total	1.94	2.82	4.07	6.75
2perc99 total	3.12	4.27	5.67	8.52
3perc50 total	1.42	1.97	2.91	4.11
3perc90 total	1.90	3.08	4.60	7.06
3perc99 total	5.26	4.26	7.40	9.73

It seems that with every run the results are getting worse despite that the cluster seems to be cleaned up correctly. It needs to be verified.

piosz · 2015-06-19T13:59:41Z

I re-run the test on a fresh cluster and total e2e pod startup time is as follow:
perc50: 2.41s, perc90: 3.10s, perc99: 5.14s

So we actually meet the goal :)
@davidopp @fgrzadkowski @wojtek-t

In the previous run there were events remaining from previous runs.

gmarek · 2015-06-19T14:39:49Z

I wouldn't trust a single run - watch time variance is really high, and we still have two watches along the way.

bprashanth · 2015-06-19T15:45:42Z

I don't think we need to take the last watch into account to measure 'time to running'. The watch lag is useful to have as a metric, but the createRunningPod routine itself (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/test/e2e/density.go#L295) knowns when the pod is running via Get. I feel like this simulates a user waiting for a Running pod better than a watch. @davidopp do we think users will actually be watching for the Running pod?

piosz · 2015-06-19T18:41:30Z

I definitely agree with you @gmarek that we need more runs. I can do it on Monday. However the trend seems to be right.

davidopp · 2015-06-19T19:50:15Z

@davidopp do we think users will actually be watching for the Running pod?

Sometimes. But I guess if they poll the API server they won't hit that final watch latency, and the most common use case right now is likely to be the user sitting at the shell prompt manually running 'get pods' over and over again. So measuring without the final watch seems OK.

piosz · 2015-06-22T11:33:36Z

I ran the test several times and the results seem to be pretty stable. @wojtek-t is going to enable the test on Jenkins by default.

perc50: 2.46s, perc90: 3.42s, perc99: 5.57s
perc50: 2.49s, perc90: 4.09s, perc99: 5.56s
perc50: 2.36s, perc90: 3.47s, perc99: 4.78s
perc50: 2.37s, perc90: 3.10s, perc99: 3.95s
perc50: 2.33s, perc90: 3.82s, perc99: 5.94s

davidopp · 2015-06-23T17:02:53Z

I'm going to move this to 1.0-post since we at least have manually-collected numbers (see previous entry). @wojtek-t can you please update this issue when these numbers are being collected in Jenkins (and give instructions on how to pull the numbers out)?

davidopp · 2015-06-23T17:04:13Z

BTW @piosz, just to verify, the numbers you collected above were including the final watch to the user?

piosz · 2015-06-23T17:38:20Z

The results are printed to stdout. Something like this:

INFO: 10% worst e2e total latencies: [{density-latency-pod-2c70d718-19b5-11e5-a288-42010af01555-41 e2e-scalability-minion-wd0b 3.045526765s} ... ]
INFO: perc50: 2.236848859s, perc90: 3.045526765s, perc99: 4.266575615s

The above logs come from http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-scalability/1535/consoleFull

I measured the metric described here #3954 (comment) as total.

davidopp · 2015-06-23T19:39:22Z

Great, thanks, closing this issue. Thanks everyone for all the work on this! It's great that in the end we basically met both the scheduling time goal and the API latency goal!

alex-mohr added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/apiserver labels Jan 29, 2015

alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/usability area/hosting and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 29, 2015

brendandburns self-assigned this Feb 2, 2015

bgrant0607 added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed area/usability labels Feb 5, 2015

davidopp assigned davidopp and unassigned brendandburns Feb 6, 2015

davidopp added the team/master label Feb 6, 2015

davidopp mentioned this issue Feb 18, 2015

Test single minion with many pods. #4119

Closed

davidopp added this to the v1.0 milestone Feb 18, 2015

davidopp mentioned this issue Feb 23, 2015

Design a mechanism to generate a synthetic workload and play it #3139

Closed

bgrant0607 mentioned this issue Mar 10, 2015

Kubelet to understand pods, and to be able to pull from apiserver #2483

Closed

bprashanth mentioned this issue Jun 19, 2015

Density should cleanup better #10080

Closed

davidopp modified the milestones: v1.0-post, v1.0 Jun 23, 2015

davidopp closed this as completed Jun 23, 2015

piosz mentioned this issue Aug 5, 2015

Setting Initial Resources #12149

Closed

stevef1uk mentioned this issue Jan 12, 2020

Enhancement request: low-latency containers kubernetes/enhancements#1446

Closed

99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954

99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954

Comments

alex-mohr commented Jan 29, 2015

alex-mohr commented Jan 29, 2015

alex-mohr commented Jan 29, 2015

brendandburns commented Jan 29, 2015

alex-mohr commented Jan 29, 2015

davidopp commented Feb 6, 2015

davidopp commented Feb 25, 2015

lavalamp commented Mar 6, 2015

yujuhong commented Mar 7, 2015

davidopp commented Mar 7, 2015

yujuhong commented Mar 7, 2015

davidopp commented Mar 7, 2015

lavalamp commented Mar 7, 2015

lavalamp commented Mar 11, 2015

gmarek commented Jun 17, 2015

bprashanth commented Jun 17, 2015

gmarek commented Jun 17, 2015

bprashanth commented Jun 17, 2015

gmarek commented Jun 17, 2015

bprashanth commented Jun 19, 2015

bgrant0607 commented Jun 19, 2015

bprashanth commented Jun 19, 2015

bgrant0607 commented Jun 19, 2015

bprashanth commented Jun 19, 2015

bgrant0607 commented Jun 19, 2015

bprashanth commented Jun 19, 2015

gmarek commented Jun 19, 2015

piosz commented Jun 19, 2015

piosz commented Jun 19, 2015

gmarek commented Jun 19, 2015

bprashanth commented Jun 19, 2015

piosz commented Jun 19, 2015

davidopp commented Jun 19, 2015

piosz commented Jun 22, 2015

davidopp commented Jun 23, 2015

davidopp commented Jun 23, 2015

piosz commented Jun 23, 2015

davidopp commented Jun 23, 2015