-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
99%ile end-to-end pod startup time w/ prepulled images < 5s on 100 node, 3000 pod cluster; linear time to # nodes and pods #3954
Comments
FWIW, repeating the experiment with 30 pause pods required 53 seconds until the last pod went from unassigned to having a hostname. 50 pods was ~90 seconds. |
Changing my mind -- P1 as usability issue and also as something that makes it hard to find the reliability bottlenecks for clusters that are used over time. |
I think a more effective data measurement is the latency distribution, --brendan On Thu, Jan 29, 2015 at 2:43 PM, Alex Mohr notifications@github.com wrote:
|
@brendandburns Latency to reporting running for a single pod is also high #3952 but that appears not to be attributable to the scheduler. Re: throughput, unless you're scheduling only 1 pod a time, it manifests as latency. That means our unit tests will run more slowly, developers take longer to test stuff, the usability is worse, and testing how a cluster does with a cumulative total of N containers launched takes a long time. It'd be nice if we could run something like a soak test faster than waiting days/weeks of realtime. |
At the 1.0 roadmap meeting the following requirement was decided for 1.0: "99% of scheduling decisions made in less than 1s on 100 node, 3000 pod clusters; linear time to number of pods and nodes" |
There was some interest expressed today in also having a latency goal. |
I just did 100 pods, it was 4.5 minutes for all of them to schedule. HOWEVER, it is unclear if this is due to the scheduler being slow or apiserver being slow; I expect if I profile scheduler I will find that it is spending most of its time waiting. |
Out of curiosity, I ran a simple test using a local cluster, where the only node is the localhost. With 4 threads each creating 25 pods (i.e. 100 pods), the total pod creation time + scheduling time (i.e., the host field is set) is ~1 minute. This is probably close to the maximum performance of what a single-threaded scheduler can achieve in the current implementation. @davidopp, what machine/cluster setup should we use for the 1.0 requirement? |
The setup we had decided on earlier was "100 node, 3000 pod clusters" |
Yes, it is cheating, but it's still a good reference point :) @lavalamp, did you create a 100-node cluster for your test? It'd be a pain to wait for the cluster bring-up/tear-down with so many nodes. Or did you mock-out the nodes...? |
@yujuhong Yes I agree, and in fact trying the same number of pods at different cluster sizes is helpful to try to pinpoint where the slowdowns are coming from. |
This was on a two node (+ master VM) cluster (so 100 pods per node, using the kubernetes/pause image). |
Preliminary result from running with #5320 is that scheduler performance is much better with boundPods gone (density test ran 2x as fast). ratio of apiserver's CPU usage to scheduler's is about 9:1... |
Yes, I did it yesterday. It looks better than I expected, as it turned out that I was "lucky" when picking Pods to analyze. Statistics are that for saturated cluster 50% is around 3s, 90% around 6s, 99% around 9s for scheduled to running, and 6/10/13 for scheduled to watch. This times are computed after Pod is scheduled, so it does not include scheduler latency (assuming that the watch latency is similar to ones I observe for e2e it's at least 3/5/6s). I'll run multiple tests for saturated cluster to check the variance of results, and run sequential test, to approximate scheduler latencies. |
We also got a couple of changes in recently that should help. If you ran the test yesterday it should have them (/stats don't hit docker, setting the right oom score on cluster addons so nodes don't die -- if the node went down and came up previously it would take the kubelet/docker a good minute to reinitialize things --). We're also setting a cgroup limit on docker to avoid cases where it can hog memory causing everything else to get really slow (#9881). |
@bprashanth good to know. Statistics are starting to look promising. I suggest defining our goal in term of latency between schedule time and container start times (exclude schedule latency because of rate limiting and end-user watch). We're already accounting for one watch latency (scheduler->kubelet), and we have only limited control on how efficient etcd watch is. There are few timestamps I look at: to-schedule := (2) - (1) Current results are:
|
My 2c
|
I run serialized version of the test and for some reason it's even worse (12/22/25). It may be a bug in the modified test, but I didn't manage to look into it. What's worse I'll be on conferences for next week and the half, so I won't have too much time to dig in. |
cc @bgrant0607 who was curious about why we needed StartTime to measure watch latency of specific pods when we have /metrics (spinoff from #10066). Pure prometheus metrics aren't as helpful for this benchmark because we need to bucket them, and it's hard to isolate the last 100 pods that we startup after we've setup a 3000 pod cluster |
Why won't a sufficiently small percentile work? Do we expect the first 3000 pods to have slower startup time than the last 100? |
The first 3000 is known to have slow startup time and put enough duress on the apiserver to also have slow watch time (docker/etcd suffer under that initial 3000 pod load load) |
Do we have reason to believe that Kubelet watch behavior is different than that for other watch clients? |
That's what we're doing right now (setting up an external watcher to measure latency), but this is to debug after we've observed a slow startup, eg: cpu/memory on the kubelet can certainly lead to the watch goroutine just not running in cases (that's what was happening with #9788). It's not like we can't get by without Starttime, just that if there was a field that had this specific timestamp, it would be useful. |
Another idea: Kill Kubelet after creating the 3000 pods. That would reset any of its monitoring stats. |
Haha, yeah, restarting things would reset stats. It's something I'd consider as a last resort (because restarting kubelet will do a lot more -- it would lead to a thundering herd of relist etc). |
@bgrant0607 FYI we currently have data redundancy in a form of Pod.StartTime and max_containers(Pod."container".Status.Phase.Running.StartedAt), which are equal (up to few milliseconds, truncated to seconds). |
I've made a similar experiment to what @gmarek did but with pod creation limit 5 pods/sec (#10020) and results are much better:
It seems that with every run the results are getting worse despite that the cluster seems to be cleaned up correctly. It needs to be verified. |
I re-run the test on a fresh cluster and total e2e pod startup time is as follow: So we actually meet the goal :) In the previous run there were events remaining from previous runs. |
I wouldn't trust a single run - watch time variance is really high, and we still have two watches along the way. |
I don't think we need to take the last watch into account to measure 'time to running'. The watch lag is useful to have as a metric, but the createRunningPod routine itself (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/test/e2e/density.go#L295) knowns when the pod is running via Get. I feel like this simulates a user waiting for a Running pod better than a watch. @davidopp do we think users will actually be watching for the Running pod? |
I definitely agree with you @gmarek that we need more runs. I can do it on Monday. However the trend seems to be right. |
Sometimes. But I guess if they poll the API server they won't hit that final watch latency, and the most common use case right now is likely to be the user sitting at the shell prompt manually running 'get pods' over and over again. So measuring without the final watch seems OK. |
I ran the test several times and the results seem to be pretty stable. @wojtek-t is going to enable the test on Jenkins by default.
|
I'm going to move this to 1.0-post since we at least have manually-collected numbers (see previous entry). @wojtek-t can you please update this issue when these numbers are being collected in Jenkins (and give instructions on how to pull the numbers out)? |
BTW @piosz, just to verify, the numbers you collected above were including the final watch to the user? |
The results are printed to stdout. Something like this:
The above logs come from http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-scalability/1535/consoleFull I measured the metric described here #3954 (comment) as total. |
Great, thanks, closing this issue. Thanks everyone for all the work on this! It's great that in the end we basically met both the scheduling time goal and the API latency goal! |
@davidopp Scheduler throughput is ~1 pod per second; should be >> 1 pod/sec.
In 0.8.1, part of #3936, I quickly created 10 kubernetes/pause pods with unique names by running a bunch of simultaneous cmdlines. I then repeatedly ran pods list, saw all 10 pods, then watched as hostname is slowly filled in at the rate of ~1 pod per second.
% gcloud preview container pods --cluster=${CLUSTER} create --name=$POD --image=kubernetes/pause --port=${PORT} &
The text was updated successfully, but these errors were encountered: