Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

davidopp · 2015-02-18T07:08:09Z

@lavalamp would you be interested in taking this (it's from the v1.0 roadmap)?

Also cc'ing @roberthbailey and @satnam6502 for good measure.

ghost · 2015-02-18T08:56:58Z

I'm looking at slow 'get pods' call #4196 and have a few observations about API server performance. I added a doc about turning on profiling to docs/devel, which should be helpful for trying to find bottlenecks. Sadly AFAICT most of the time is spent in YAML handling, which I guess means that we need to reduce number of etcd entries parsed in normal workflow, try to parallelize some of them, write custom YAML parser or change the way things are stored in etcd.

As first two options have no impact on things outside the master I'm trying to apply them to get pods call. I've already tried parallelizing, but it do not give big improvement.

Also I'm sometimes seeing slow response from server when checking metrics. Most of the times it's instantaneous, but once in a while it takes few seconds. I don't know yet how metrics are stored/computed, but (I guess) it's not a master API call. This could mean that there may be a problem with HTTP server being slow/overcommited from time to time.

ghost · 2015-02-18T11:26:40Z

There's an issue with this goal, namely there's no machine on which API server should have this <1sec. performance. It is important, as on GCE by default we're running on single threaded CPU, hence there's not a lot of space for parallelization of anything.

lavalamp · 2015-02-18T16:51:43Z

Have we measured this and is it actually not meeting this target?
1.a) As Marek said, what is the golden system configuration upon which we will take this measurement? What is its load? 1s is meaningless by itself. I'd rather say "contstant time, with a low constant factor".

Even though it's single-core, parellelizing should still help if there's network access. I think I already did that before I left a month ago in the places it was appropriate, though.

...also it might be worth double checking if we ever allow more than one thread by setting GOMAXPROCS.

davidopp · 2015-02-18T18:56:33Z

@lavalamp No, we haven't measured. It's possible that we already meet the goal. Also, as you noted, the goal here is under-defined, as it doesn't specify a workload. (I beileve @satnam6502 is working on workload generators.) So, part of this issue is to define a workload.

ghost · 2015-02-18T19:23:57Z

I added parallel decoding of data extracted from etcd for get pods, as it takes 1.5 to 6 sec when 50 pods are present (it's more like 1.2-4 with an empty cluster). It did not help.

Additionally get replicationcontroller is EXTREMELY slow, as per #4536. What is worse than this 15 sec "normal" runtime is that when I was trying to do get during the time when stop is running easily takes 150 sec (yup: 2.5 minutes). I noticed it in the evening, so I didn't have time to investigate it further.

I traced only APIserver part of execution.

ghost · 2015-02-19T12:54:14Z

Rephrasing my previous comment: get rc running time is far from constant. From <1sec in empty cluster to >15 sec when there's 50 RCs running, and minutes when multiple '''stop rc''' are under way (easily reproducible).

We should agree on some golden master/node configuration. I think it's kind of important, as if we assume that master will be running on single thread, the architecture may be quite different, than if we assume multithreaded one. In addition if we're aiming at multithreaded master I believe that we should add a second thread to the default configuration of the master machine for each provider, to make testing more realistic. Similar thing applies to node machines.

My personal opinion is that we should assume that both master and kubelet are running on multithreaded machines, and modify our default accordingly.

vmarmol · 2015-02-19T16:51:21Z

As per @gmarek's comment it seems reasonable to provide a minimum spec of X cores for our SLO of Y. It does seem unreasonable to expect a single core machine to run 50 pods (of any size) in a timely manner.

satnam6502 · 2015-02-19T17:09:59Z

When I was last doing experiments I used n n1-highmem-16 master and I also tried a n1-highcpu-16 master but still ran into lots of issues -- perhaps because our API server needs to be (more?) multi-threaded?

lavalamp · 2015-02-19T18:29:02Z

OK, I see we don't actually use GOMAXPROCS outside of our tests. I'm making a PR to add that in, so at least we'll use more cores if we have them.

gmarek · 2015-02-20T18:05:21Z

When testing some change I run some experiments to see how far from this goal we are in two configurations:

stupid one (master 16 cores, 4 kubelets 2 cores each), 50 replication controllers, each with 1 pod
more-reasonable-but-still-very-small-one (master 4 cores, 10 kubelets 1 core each), 200 replication controllers, each with 1 pod
I tested get API calls (pods and rc), and it turned out that get pods is WAY faster than get rc, and the difference grows significantly with the size of the cluster. It should also be clear that gets are getting slower with more running pods.

I'm sorry for the format, but GitHub does not allow pdfs:/ All results in seconds.

1 master (highcpu-16), 4 nodes (standard-2), 50 1-pod rc

empty	get rc 50 pods	get pod 50 pods
0.51	4.19	0.75
0.48	3.53	0.71
0.51	3.41	0.85
0.51	3.47	0.76
0.5	3.43	0.72
0.49	4.01	0.73
0.5	4.17	0.72
0.51	3.89	0.72
0.49	3.76	0.83
0.51	3.57	0.75

1 master (highcpu-4), 10 nodes (standard-1), 200 1-pod rc

get rc emtpy	get rc 200 pods	get pod 200 pods
0.5	1:13.41	0.91
0.52	1:09.00	0.87
0.51	1:10.98	0.85
0.5		6.82
0.54		1.68
0.51		1.4
0.5		1.43
0.54		1.72
0.5		1.4
0.48		1.38
0.5		1.37

lavalamp · 2015-02-20T19:33:16Z

@bprashanth is working on moving replication controller status computation into replication controller, which is going to have the side effect of making list rc not be O(#pods * #rc) anymore.

lavalamp · 2015-02-20T19:34:18Z

#4675 will add GOMAXPROCS to the scheduler, which I missed yesterday.

lavalamp · 2015-02-25T18:53:08Z

After #4429 lands, the times for replication controllers should drop down to be about the same as the pod times.

timothysc · 2015-03-11T12:38:52Z

I'll paste my comment in this thread then:

When checking out the performance data on the apiserver under soak testing it appears that there is a fair amount of gc work going on:

20.43% kube-apiserver [.] runtime.MSpan_Sweep
3.91% kube-apiserver [.] findrunnable
3.05% kube-apiserver [.] runtime.mallocgc
2.29% kube-apiserver [.] scanblock
1.83% kube-apiserver [.] runtime.sweepone
1.75% kube-apiserver [.] runtime.atomicload
1.73% kube-apiserver [.] runtime.xchg
1.54% kube-apiserver [.] runtime.duffcopy
1.47% kube-apiserver [.] flushptrbuf
1.22% kube-apiserver [.] runtime.memclr
1.19% kube-apiserver [.] runqgrab

Seems related to: golang/go#9265

go version go1.3.3 linux/amd64

timothysc · 2015-03-11T13:10:17Z

More color:

#4862 shows a rc fill time ~10 minutes.

A part of this lag is apiserver, but there are other issues in the system that we're tracing.

lavalamp · 2015-03-11T16:41:05Z

@timothysc what process are you using to generate those numbers?

I think taking boundpods out will drastically speed adding & removing pods to the system, I think there's a lot of contention for updating that.

timothysc · 2015-03-11T16:46:47Z

@lavalamp - #5260 is the load we used for the numbers above.

davidopp · 2015-06-02T17:19:08Z

@fgrzadkowski There was a request to also test on g1-small since that's the smallest master users might run in practice. There's no expectation that the performance will be good -- we just want to make sure it is not unusable.

fgrzadkowski · 2015-06-03T13:53:11Z

@piosz have enabled load test which creates :

300 rc with 5 pods each
25 rc with 30 pods each
3 rc with 250 pods each
All creations are spread across all first 10 minuts. After that we randomly resize/delete/recreate all RCs.

Metrics from the first run:

INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:1.50546s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.360146s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.022879s}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:481.39ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:400.466ms}

davidopp · 2015-06-03T16:35:45Z

Thanks for the update! Can you indicate which PRs related to this issue are still out for review (other than the CoreOS one), if any?

fgrzadkowski · 2015-06-03T17:08:26Z

CoreOS one is the only PR in-flight. Number in my previous comment where taken from jenkins.

davidopp · 2015-06-03T17:21:35Z

CoreOS one is the only PR in-flight. Number in my previous comment where taken from jenkins.

Great, thanks! So the CoreOS PR will improve the above numbers further? (I assume "taken from jenkins" means only using what is already in the codebase -- but wanted to make sure I am understanding correctly.)

fgrzadkowski · 2015-06-08T08:35:54Z

@davidopp Yes. I'll send a PR to update go-etcd client library version today.

davidopp · 2015-06-08T17:03:02Z

@fgrzadkowski : Once you're confident we're meeting the goal described in the title of this issue, please move the issue to milestone "1.0-post". (We don't need to wait on the etcd PRs to merge if we're already meeting the goal.)

fgrzadkowski · 2015-06-09T17:22:58Z

I enabled load test. Currently it plays the following scenario:

[0-10min] Run 300 RCs with 5 replicas each, 25 RCs with 30 pods each, 3 RC with 250 pods each
[10-13min] Scale all RCs randomly and list pods within RC
[13-16min] Repeat 2
[16-26min] Delete all RC
The test is configured in a way to keep ~5 creation/deletions per second, which IMO sounds reasonable. The test checks latency metrics after all of the phases to see if we meet 1s criteria. This means that some of requests were performed when the cluster was not full.

First run gave really good results (with 8 core master):

load test:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:707.259ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:382.948ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:239.901ms}
INFO: Top latency metric: {Verb:DELETE Resource:pods Quantile:0.99 Latency:193.937ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:161.933ms}

density 30 pods per node:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.60586s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:969.726ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:513.258ms}
INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:494.578ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:491.616ms}

I think we should pay more attention to the load test which plays more realistic scenario.

Tomorrow I will set thresholds to 1s for load test and 2s for density test.

@davidopp @brendanburns @wojtek-t Assuming that next runs will give similar results I suggest we announce victory. Does that sound reasonable to you?

bprashanth · 2015-06-09T17:33:40Z

@fgrzadkowski just fyi, with the load test we resize rcs quickly so there's a chance you will run into #9147

davidopp · 2015-06-09T19:22:19Z

Assuming that next runs will give similar results I suggest we announce victory. Does that sound reasonable to you?

Yes.

fgrzadkowski · 2015-06-12T09:11:44Z

@davidopp Results with 4 core master:

Density with 3 pods per node

INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:589.097ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:430.241ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:372.226ms}
INFO: Top latency metric: {Verb:POST Resource:bindings Quantile:0.99 Latency:359.243ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:302.09ms}

Density with 30 pods per node

INFO: WARNING Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:2.656382s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.269396s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:854.817ms}
INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:481.31ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:370.808ms}

Load test

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:575.679ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:372.226ms}
INFO: Top latency metric: {Verb:DELETE Resource:pods Quantile:0.99 Latency:190.719ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:152.142ms}
INFO: Top latency metric: {Verb:PUT Resource:replicationcontrollers Quantile:0.99 Latency:151.097ms}

NOTE: For load tests some requests were performed when we were creating/deleting RCs, so the cluster was not full (full spectrum from 0 to 30 pods per node)

@davidopp I suggest we keep 4 core master, but increase the threshold for density to 3 seconds. WDYT?

davidopp · 2015-06-12T09:17:37Z

Sure, sounds good.

davidopp · 2015-06-12T09:18:48Z

And thanks for collecting the additional data. That doesn't look too terrible vs. the 8-core data (i.e. it's not 2x the latency).

davidopp · 2015-06-19T23:40:05Z

After merging #9862 the latency improved

#9862 (comment)
says

Results from density test with 30 pods per node in 100 node cluster with n1-standard-4 master:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.605515s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.270142s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:529.251ms}
INFO: Top latency metric: {Verb:POST Resource:bindings Quantile:0.99 Latency:444.361ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:426.816ms}

This is way better (~25%) than the usual run for density test.

davidopp added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/master labels Feb 18, 2015

davidopp added this to the v1.0 milestone Feb 18, 2015

roberthbailey added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 18, 2015

davidopp assigned lavalamp Feb 18, 2015

lavalamp mentioned this issue Feb 19, 2015

Set GOMAXPROCS. Print errors on stderr. #4605

Merged

gmarek mentioned this issue Feb 20, 2015

Parallelize YAML processing for pods in APIserver. #4598

Closed

davidopp mentioned this issue Feb 23, 2015

Design a mechanism to generate a synthetic workload and play it #3139

Closed

This was referenced Mar 3, 2015

Kubectl get rc for ~50 replication controllers takes >15 sec. #4536

Closed

get pods blocks 7-10 seconds if pods are being created at the same time #4196

Closed

goltermann mentioned this issue Mar 10, 2015

apiserver under load. #5263

Closed

bgrant0607 mentioned this issue Mar 10, 2015

Kubelet to understand pods, and to be able to pull from apiserver #2483

Closed

timothysc mentioned this issue Mar 11, 2015

Support high availability clusters #473

Closed

This was referenced Jun 3, 2015

Use ugorji codec for unmarshalling responses from etcd server. coreos/go-etcd#218

Merged

Decrease threshold for latency metrics in density tests #9181

Merged

fgrzadkowski mentioned this issue Jun 3, 2015

Load test: set threshold for latency metrics and add periodic lists #9184

Merged

bgrant0607 mentioned this issue Jun 6, 2015

Don't pretty-print by default. Saves an allocation (or more) and a copy. #9361

Merged

fgrzadkowski mentioned this issue Jun 8, 2015

Update go-etcd client library version and add ugorji codec to dependencies #9390

Merged

davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015

davidopp mentioned this issue Jun 9, 2015

Failing performance tests on Jenkins #7561

Closed

davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 10, 2015

fgrzadkowski mentioned this issue Jun 10, 2015

Decrease threshold for density (2s) and load test (1s) #9565

Merged

jszczepkowski closed this as completed in #9565 Jun 10, 2015

davidopp reopened this Jun 12, 2015

fgrzadkowski mentioned this issue Jun 12, 2015

Increase threshold for density test to accomodate slower 4 core master #9709

Merged

jszczepkowski closed this as completed in #9709 Jun 12, 2015

mattmoor mentioned this issue Jun 26, 2018

Reject secrets with invalid annotation value knative/build#200

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

davidopp commented Feb 18, 2015

ghost commented Feb 18, 2015

ghost commented Feb 18, 2015

lavalamp commented Feb 18, 2015

davidopp commented Feb 18, 2015

ghost commented Feb 18, 2015

ghost commented Feb 19, 2015

vmarmol commented Feb 19, 2015

satnam6502 commented Feb 19, 2015

lavalamp commented Feb 19, 2015

gmarek commented Feb 20, 2015

lavalamp commented Feb 20, 2015

lavalamp commented Feb 20, 2015

lavalamp commented Feb 25, 2015

timothysc commented Mar 11, 2015

timothysc commented Mar 11, 2015

lavalamp commented Mar 11, 2015

timothysc commented Mar 11, 2015

davidopp commented Jun 2, 2015

fgrzadkowski commented Jun 3, 2015

davidopp commented Jun 3, 2015

fgrzadkowski commented Jun 3, 2015

davidopp commented Jun 3, 2015

fgrzadkowski commented Jun 8, 2015

davidopp commented Jun 8, 2015

fgrzadkowski commented Jun 9, 2015

bprashanth commented Jun 9, 2015

davidopp commented Jun 9, 2015

fgrzadkowski commented Jun 12, 2015

davidopp commented Jun 12, 2015

davidopp commented Jun 12, 2015

davidopp commented Jun 19, 2015

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

Comments

davidopp commented Feb 18, 2015

ghost commented Feb 18, 2015

ghost commented Feb 18, 2015

lavalamp commented Feb 18, 2015

davidopp commented Feb 18, 2015

ghost commented Feb 18, 2015

ghost commented Feb 19, 2015

vmarmol commented Feb 19, 2015

satnam6502 commented Feb 19, 2015

lavalamp commented Feb 19, 2015

gmarek commented Feb 20, 2015

1 master (highcpu-16), 4 nodes (standard-2), 50 1-pod rc

1 master (highcpu-4), 10 nodes (standard-1), 200 1-pod rc

lavalamp commented Feb 20, 2015

lavalamp commented Feb 20, 2015

lavalamp commented Feb 25, 2015

timothysc commented Mar 11, 2015

timothysc commented Mar 11, 2015

lavalamp commented Mar 11, 2015

timothysc commented Mar 11, 2015

davidopp commented Jun 2, 2015

fgrzadkowski commented Jun 3, 2015

davidopp commented Jun 3, 2015

fgrzadkowski commented Jun 3, 2015

davidopp commented Jun 3, 2015

fgrzadkowski commented Jun 8, 2015

davidopp commented Jun 8, 2015

fgrzadkowski commented Jun 9, 2015

bprashanth commented Jun 9, 2015

davidopp commented Jun 9, 2015

fgrzadkowski commented Jun 12, 2015

davidopp commented Jun 12, 2015

davidopp commented Jun 12, 2015

davidopp commented Jun 19, 2015