Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954

Closed
alex-mohr opened this issue Jan 29, 2015 · 123 comments
Assignees
Labels
area/apiserver priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@alex-mohr
Copy link
Contributor

@davidopp Scheduler throughput is ~1 pod per second; should be >> 1 pod/sec.

In 0.8.1, part of #3936, I quickly created 10 kubernetes/pause pods with unique names by running a bunch of simultaneous cmdlines. I then repeatedly ran pods list, saw all 10 pods, then watched as hostname is slowly filled in at the rate of ~1 pod per second.

% gcloud preview container pods --cluster=${CLUSTER} create --name=$POD --image=kubernetes/pause --port=${PORT} &

@alex-mohr alex-mohr added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. area/apiserver labels Jan 29, 2015
@alex-mohr
Copy link
Contributor Author

FWIW, repeating the experiment with 30 pause pods required 53 seconds until the last pod went from unassigned to having a hostname. 50 pods was ~90 seconds.

@alex-mohr alex-mohr added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/usability area/hosting and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 29, 2015
@alex-mohr
Copy link
Contributor Author

Changing my mind -- P1 as usability issue and also as something that makes it hard to find the reliability bottlenecks for clusters that are used over time.

@brendandburns
Copy link
Contributor

I think a more effective data measurement is the latency distribution,
rather than the throughput. I sincerely doubt that we're going to have a
need for high throughput. But if the latency is high (or variable) we
should fix that.

--brendan

On Thu, Jan 29, 2015 at 2:43 PM, Alex Mohr notifications@github.com wrote:

Changing my mind -- P1 as usability issue and also as something that makes
it hard to find the reliability bottlenecks for clusters that are used over
time.


Reply to this email directly or view it on GitHub
#3954 (comment)
.

@alex-mohr
Copy link
Contributor Author

@brendandburns Latency to reporting running for a single pod is also high #3952 but that appears not to be attributable to the scheduler.

Re: throughput, unless you're scheduling only 1 pod a time, it manifests as latency. That means our unit tests will run more slowly, developers take longer to test stuff, the usability is worse, and testing how a cluster does with a cumulative total of N containers launched takes a long time. It'd be nice if we could run something like a soak test faster than waiting days/weeks of realtime.

@brendandburns brendandburns self-assigned this Feb 2, 2015
@bgrant0607 bgrant0607 added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed area/usability labels Feb 5, 2015
@davidopp davidopp assigned davidopp and unassigned brendandburns Feb 6, 2015
@davidopp
Copy link
Member

davidopp commented Feb 6, 2015

At the 1.0 roadmap meeting the following requirement was decided for 1.0: "99% of scheduling decisions made in less than 1s on 100 node, 3000 pod clusters; linear time to number of pods and nodes"

@davidopp
Copy link
Member

There was some interest expressed today in also having a latency goal.

@lavalamp
Copy link
Member

lavalamp commented Mar 6, 2015

I just did 100 pods, it was 4.5 minutes for all of them to schedule. HOWEVER, it is unclear if this is due to the scheduler being slow or apiserver being slow; I expect if I profile scheduler I will find that it is spending most of its time waiting.

@yujuhong
Copy link
Contributor

yujuhong commented Mar 7, 2015

Out of curiosity, I ran a simple test using a local cluster, where the only node is the localhost. With 4 threads each creating 25 pods (i.e. 100 pods), the total pod creation time + scheduling time (i.e., the host field is set) is ~1 minute. This is probably close to the maximum performance of what a single-threaded scheduler can achieve in the current implementation.

@davidopp, what machine/cluster setup should we use for the 1.0 requirement?

@davidopp
Copy link
Member

davidopp commented Mar 7, 2015

The setup we had decided on earlier was "100 node, 3000 pod clusters"
So I think doing it on one node is cheating. :)

@yujuhong
Copy link
Contributor

yujuhong commented Mar 7, 2015

Yes, it is cheating, but it's still a good reference point :)

@lavalamp, did you create a 100-node cluster for your test? It'd be a pain to wait for the cluster bring-up/tear-down with so many nodes. Or did you mock-out the nodes...?

@davidopp
Copy link
Member

davidopp commented Mar 7, 2015

@yujuhong Yes I agree, and in fact trying the same number of pods at different cluster sizes is helpful to try to pinpoint where the slowdowns are coming from.

@lavalamp
Copy link
Member

lavalamp commented Mar 7, 2015

This was on a two node (+ master VM) cluster (so 100 pods per node, using the kubernetes/pause image).

@lavalamp
Copy link
Member

Preliminary result from running with #5320 is that scheduler performance is much better with boundPods gone (density test ran 2x as fast). ratio of apiserver's CPU usage to scheduler's is about 9:1...

@gmarek
Copy link
Contributor

gmarek commented Jun 17, 2015

Yes, I did it yesterday. It looks better than I expected, as it turned out that I was "lucky" when picking Pods to analyze. Statistics are that for saturated cluster 50% is around 3s, 90% around 6s, 99% around 9s for scheduled to running, and 6/10/13 for scheduled to watch. This times are computed after Pod is scheduled, so it does not include scheduler latency (assuming that the watch latency is similar to ones I observe for e2e it's at least 3/5/6s).

I'll run multiple tests for saturated cluster to check the variance of results, and run sequential test, to approximate scheduler latencies.

@bprashanth
Copy link
Contributor

Yes, I did it yesterday. It looks better than I expected, as it turned out that I was "lucky" when picking Pods to analyze.

We also got a couple of changes in recently that should help. If you ran the test yesterday it should have them (/stats don't hit docker, setting the right oom score on cluster addons so nodes don't die -- if the node went down and came up previously it would take the kubelet/docker a good minute to reinitialize things --). We're also setting a cgroup limit on docker to avoid cases where it can hog memory causing everything else to get really slow (#9881).

@gmarek
Copy link
Contributor

gmarek commented Jun 17, 2015

@bprashanth good to know. Statistics are starting to look promising.
TL;DR; Variance of results for saturated cluster is pretty big. Schedule latencies are what can be expected when scheduling 100 new pods and having 10pod/s rate limit in place. As for Kubelet it looks that it quite OK, i.e. scales ~linearly from the number of Pods, but for saturated cluster it takes too much time to start a new Pod. Also watch is slow-ish (usually), with very high variance.

I suggest defining our goal in term of latency between schedule time and container start times (exclude schedule latency because of rate limiting and end-user watch). We're already accounting for one watch latency (scheduler->kubelet), and we have only limited control on how efficient etcd watch is.

There are few timestamps I look at:
(1) Pod creation time (time when RC created the Pod first time)
(2) Pod schedule time (time when scheduler wrote an Event for Pod scheduling)
(3) Pod startup time (time when all Pods containers are running)
(4) Pod watch time (time when external watch notices that Pod containers are running)

to-schedule := (2) - (1)
from-schedule := (3) - (2)
watch := (4) - (3)
schedule-watch := (4) - (2)
total := (4) - (1)

Current results are:

3 10 20 30
1perc50 to-schedule 5 5 5 5
1perc90 to-schedule 7 7 7 7
1perc99 to-schedule 8 8 8 7
1perc50 from-schedule 1 1 2 3
1perc90 from-schedule 1 2 4 6
1perc99 from-schedule 2 3 7 9
1perc50 watch 2.4 0.88 0.88 3.36
1perc90 watch 3.56 1.36 1.45 4.83
1perc99 watch 5.25 4.28 1.99 5.83
1perc50 schedule-watch 3.23 2.09 2.85 6.36
1perc90 schedule-watch 4.25 2.94 5.18 10.16
1perc99 schedule-watch 7.25 5.28 8.69 12.98
1perc50 total 7.82 7.2 8.28 11.16
1perc90 total 10.24 9.4 10.65 15.44
1perc99 total 11.56 9.94 12.78 18.98
2perc50 to-schedule 5 4 4 5
2perc90 to-schedule 7 7 7 8
2perc99 to-schedule 8 8 7 8
2perc50 from-schedule 0 1 2 2
2perc90 from-schedule 1 2 4 4
2perc99 from-schedule 1 4 7 5
2perc50 watch 2.27 0.88 0.94 0.87
2perc90 watch 3.48 1.32 1.35 1.47
2perc99 watch 3.83 2.24 1.97 1.89
2perc50 schedule-watch 2.83 1.91 2.87 2.96
2perc90 schedule-watch 4.13 2.82 5 4.62
2perc99 schedule-watch 4.83 5.87 8.19 6.75
2perc50 total 7.48 6.36 7.85 8.21
2perc90 total 10.13 8.64 10.28 10.61
2perc99 total 11.65 9.4 14.19 13.75
3perc50 to-schedule 5 5 5 6
3perc90 to-schedule 7 8 7 8
3perc99 to-schedule 7 9 8 9
3perc50 from-schedule 1 1 2 3
3perc90 from-schedule 1 2 4 5
3perc99 from-schedule 2 4 6 9
3perc50 watch 2.49 0.9 0.9 0.8
3perc90 watch 3.49 1.44 1.3 1.38
3perc99 watch 4.1 2.04 2.22 2.43
3perc50 schedule-watch 2.77 2.11 2.97 3.86
3perc90 schedule-watch 3.79 3.14 4.89 6.27
3perc99 schedule-watch 4.79 4.69 7.14 10.61
3perc50 total 8.1 7.14 8.43 9.43
3perc90 total 9.49 10.01 10.69 12.79
3perc99 total 10.79 11.69 12.92 15.24

@bprashanth
Copy link
Contributor

My 2c

  1. Watch latency: The variance here goes back to our 'all apiserver operations need to return in 1s' goal. We're clearly not meeting it for watch (time between mutation and watch notification < 1s). We have optimizations planned post 1.0, but I don't think it'll get better before that.
  2. Scheduling latency: if we evolve the test to do what we initially discussed (schedule 1 pod instead of an rc with 100) the scheduling latency should drop to 0.

@gmarek
Copy link
Contributor

gmarek commented Jun 17, 2015

I run serialized version of the test and for some reason it's even worse (12/22/25). It may be a bug in the modified test, but I didn't manage to look into it. What's worse I'll be on conferences for next week and the half, so I won't have too much time to dig in.

@bprashanth
Copy link
Contributor

cc @bgrant0607 who was curious about why we needed StartTime to measure watch latency of specific pods when we have /metrics (spinoff from #10066).

Pure prometheus metrics aren't as helpful for this benchmark because we need to bucket them, and it's hard to isolate the last 100 pods that we startup after we've setup a 3000 pod cluster

@bgrant0607
Copy link
Member

Why won't a sufficiently small percentile work? Do we expect the first 3000 pods to have slower startup time than the last 100?

@bprashanth
Copy link
Contributor

The first 3000 is known to have slow startup time and put enough duress on the apiserver to also have slow watch time (docker/etcd suffer under that initial 3000 pod load load)

@bgrant0607
Copy link
Member

Do we have reason to believe that Kubelet watch behavior is different than that for other watch clients?

@bprashanth
Copy link
Contributor

That's what we're doing right now (setting up an external watcher to measure latency), but this is to debug after we've observed a slow startup, eg: cpu/memory on the kubelet can certainly lead to the watch goroutine just not running in cases (that's what was happening with #9788). It's not like we can't get by without Starttime, just that if there was a field that had this specific timestamp, it would be useful.

@bgrant0607
Copy link
Member

Another idea: Kill Kubelet after creating the 3000 pods. That would reset any of its monitoring stats.

@bprashanth
Copy link
Contributor

Haha, yeah, restarting things would reset stats. It's something I'd consider as a last resort (because restarting kubelet will do a lot more -- it would lead to a thundering herd of relist etc).

@gmarek
Copy link
Contributor

gmarek commented Jun 19, 2015

@bgrant0607 FYI we currently have data redundancy in a form of Pod.StartTime and max_containers(Pod."container".Status.Phase.Running.StartedAt), which are equal (up to few milliseconds, truncated to seconds).

@piosz
Copy link
Member

piosz commented Jun 19, 2015

I've made a similar experiment to what @gmarek did but with pod creation limit 5 pods/sec (#10020) and results are much better:

3 10 20 30
1perc50 to-schedule 0 0 0 0
1perc90 to-schedule 1 0 1 1
1perc99 to-schedule 1 1 1 1
1perc50 from-schedule 1 1 2 2
1perc90 from-schedule 1 2 3 3
1perc99 from-schedule 1 3 5 5
1perc50 watch 0.67 0.70 0.69 0.71
1perc90 watch 1.06 1.07 1.04 1.03
1perc99 watch 2.72 1.51 1.48 1.93
1perc50 schedule-watch 1.20 1.83 2.75 2.81
1perc90 schedule-watch 1.69 2.53 4.01 4.03
1perc99 schedule-watch 2.72 4.51 6.08 6.93
1perc50 total 1.43 1.85 2.87 2.93
1perc90 total 1.81 2.53 4.14 4.05
1perc99 total 3.72 4.51 6.08 6.93
2perc50 total 1.46 1.99 2.80 4.21
2perc90 total 1.94 2.82 4.07 6.75
2perc99 total 3.12 4.27 5.67 8.52
3perc50 total 1.42 1.97 2.91 4.11
3perc90 total 1.90 3.08 4.60 7.06
3perc99 total 5.26 4.26 7.40 9.73

It seems that with every run the results are getting worse despite that the cluster seems to be cleaned up correctly. It needs to be verified.

@piosz
Copy link
Member

piosz commented Jun 19, 2015

I re-run the test on a fresh cluster and total e2e pod startup time is as follow:
perc50: 2.41s, perc90: 3.10s, perc99: 5.14s

So we actually meet the goal :)
@davidopp @fgrzadkowski @wojtek-t

In the previous run there were events remaining from previous runs.

@gmarek
Copy link
Contributor

gmarek commented Jun 19, 2015

I wouldn't trust a single run - watch time variance is really high, and we still have two watches along the way.

@bprashanth
Copy link
Contributor

I don't think we need to take the last watch into account to measure 'time to running'. The watch lag is useful to have as a metric, but the createRunningPod routine itself (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/test/e2e/density.go#L295) knowns when the pod is running via Get. I feel like this simulates a user waiting for a Running pod better than a watch. @davidopp do we think users will actually be watching for the Running pod?

@piosz
Copy link
Member

piosz commented Jun 19, 2015

I definitely agree with you @gmarek that we need more runs. I can do it on Monday. However the trend seems to be right.

@davidopp
Copy link
Member

@davidopp do we think users will actually be watching for the Running pod?

Sometimes. But I guess if they poll the API server they won't hit that final watch latency, and the most common use case right now is likely to be the user sitting at the shell prompt manually running 'get pods' over and over again. So measuring without the final watch seems OK.

@piosz
Copy link
Member

piosz commented Jun 22, 2015

I ran the test several times and the results seem to be pretty stable. @wojtek-t is going to enable the test on Jenkins by default.

  • perc50: 2.46s, perc90: 3.42s, perc99: 5.57s
  • perc50: 2.49s, perc90: 4.09s, perc99: 5.56s
  • perc50: 2.36s, perc90: 3.47s, perc99: 4.78s
  • perc50: 2.37s, perc90: 3.10s, perc99: 3.95s
  • perc50: 2.33s, perc90: 3.82s, perc99: 5.94s

@davidopp
Copy link
Member

I'm going to move this to 1.0-post since we at least have manually-collected numbers (see previous entry). @wojtek-t can you please update this issue when these numbers are being collected in Jenkins (and give instructions on how to pull the numbers out)?

@davidopp davidopp modified the milestones: v1.0-post, v1.0 Jun 23, 2015
@davidopp
Copy link
Member

BTW @piosz, just to verify, the numbers you collected above were including the final watch to the user?

@piosz
Copy link
Member

piosz commented Jun 23, 2015

The results are printed to stdout. Something like this:

INFO: 10% worst e2e total latencies: [{density-latency-pod-2c70d718-19b5-11e5-a288-42010af01555-41 e2e-scalability-minion-wd0b 3.045526765s} ... ]
INFO: perc50: 2.236848859s, perc90: 3.045526765s, perc99: 4.266575615s

The above logs come from http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-scalability/1535/consoleFull

I measured the metric described here #3954 (comment) as total.

@davidopp
Copy link
Member

Great, thanks, closing this issue. Thanks everyone for all the work on this! It's great that in the end we basically met both the scheduling time goal and the API latency goal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests