Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

Closed
davidopp opened this issue Feb 18, 2015 · 77 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@davidopp
Copy link
Member

@lavalamp would you be interested in taking this (it's from the v1.0 roadmap)?

Also cc'ing @roberthbailey and @satnam6502 for good measure.

@davidopp davidopp added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/master labels Feb 18, 2015
@davidopp davidopp added this to the v1.0 milestone Feb 18, 2015
@roberthbailey roberthbailey added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 18, 2015
@ghost
Copy link

ghost commented Feb 18, 2015

I'm looking at slow 'get pods' call #4196 and have a few observations about API server performance. I added a doc about turning on profiling to docs/devel, which should be helpful for trying to find bottlenecks. Sadly AFAICT most of the time is spent in YAML handling, which I guess means that we need to reduce number of etcd entries parsed in normal workflow, try to parallelize some of them, write custom YAML parser or change the way things are stored in etcd.

As first two options have no impact on things outside the master I'm trying to apply them to get pods call. I've already tried parallelizing, but it do not give big improvement.

Also I'm sometimes seeing slow response from server when checking metrics. Most of the times it's instantaneous, but once in a while it takes few seconds. I don't know yet how metrics are stored/computed, but (I guess) it's not a master API call. This could mean that there may be a problem with HTTP server being slow/overcommited from time to time.

@ghost
Copy link

ghost commented Feb 18, 2015

There's an issue with this goal, namely there's no machine on which API server should have this <1sec. performance. It is important, as on GCE by default we're running on single threaded CPU, hence there's not a lot of space for parallelization of anything.

@lavalamp
Copy link
Member

  1. Have we measured this and is it actually not meeting this target?
    1.a) As Marek said, what is the golden system configuration upon which we will take this measurement? What is its load? 1s is meaningless by itself. I'd rather say "contstant time, with a low constant factor".

Even though it's single-core, parellelizing should still help if there's network access. I think I already did that before I left a month ago in the places it was appropriate, though.

...also it might be worth double checking if we ever allow more than one thread by setting GOMAXPROCS.

@davidopp
Copy link
Member Author

@lavalamp No, we haven't measured. It's possible that we already meet the goal. Also, as you noted, the goal here is under-defined, as it doesn't specify a workload. (I beileve @satnam6502 is working on workload generators.) So, part of this issue is to define a workload.

@ghost
Copy link

ghost commented Feb 18, 2015

I added parallel decoding of data extracted from etcd for get pods, as it takes 1.5 to 6 sec when 50 pods are present (it's more like 1.2-4 with an empty cluster). It did not help.

Additionally get replicationcontroller is EXTREMELY slow, as per #4536. What is worse than this 15 sec "normal" runtime is that when I was trying to do get during the time when stop is running easily takes 150 sec (yup: 2.5 minutes). I noticed it in the evening, so I didn't have time to investigate it further.

I traced only APIserver part of execution.

@ghost
Copy link

ghost commented Feb 19, 2015

Rephrasing my previous comment: get rc running time is far from constant. From <1sec in empty cluster to >15 sec when there's 50 RCs running, and minutes when multiple '''stop rc''' are under way (easily reproducible).

We should agree on some golden master/node configuration. I think it's kind of important, as if we assume that master will be running on single thread, the architecture may be quite different, than if we assume multithreaded one. In addition if we're aiming at multithreaded master I believe that we should add a second thread to the default configuration of the master machine for each provider, to make testing more realistic. Similar thing applies to node machines.

My personal opinion is that we should assume that both master and kubelet are running on multithreaded machines, and modify our default accordingly.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 19, 2015

As per @gmarek's comment it seems reasonable to provide a minimum spec of X cores for our SLO of Y. It does seem unreasonable to expect a single core machine to run 50 pods (of any size) in a timely manner.

@satnam6502
Copy link
Contributor

When I was last doing experiments I used n n1-highmem-16 master and I also tried a n1-highcpu-16 master but still ran into lots of issues -- perhaps because our API server needs to be (more?) multi-threaded?

@lavalamp
Copy link
Member

OK, I see we don't actually use GOMAXPROCS outside of our tests. I'm making a PR to add that in, so at least we'll use more cores if we have them.

@gmarek
Copy link
Contributor

gmarek commented Feb 20, 2015

When testing some change I run some experiments to see how far from this goal we are in two configurations:

  • stupid one (master 16 cores, 4 kubelets 2 cores each), 50 replication controllers, each with 1 pod
  • more-reasonable-but-still-very-small-one (master 4 cores, 10 kubelets 1 core each), 200 replication controllers, each with 1 pod
    I tested get API calls (pods and rc), and it turned out that get pods is WAY faster than get rc, and the difference grows significantly with the size of the cluster. It should also be clear that gets are getting slower with more running pods.

I'm sorry for the format, but GitHub does not allow pdfs:/ All results in seconds.

1 master (highcpu-16), 4 nodes (standard-2), 50 1-pod rc

empty get rc 50 pods get pod 50 pods
0.51 4.19 0.75
0.48 3.53 0.71
0.51 3.41 0.85
0.51 3.47 0.76
0.5 3.43 0.72
0.49 4.01 0.73
0.5 4.17 0.72
0.51 3.89 0.72
0.49 3.76 0.83
0.51 3.57 0.75

1 master (highcpu-4), 10 nodes (standard-1), 200 1-pod rc

get rc emtpy get rc 200 pods get pod 200 pods
0.5 1:13.41 0.91
0.52 1:09.00 0.87
0.51 1:10.98 0.85
0.5 6.82
0.54 1.68
0.51 1.4
0.5 1.43
0.54 1.72
0.5 1.4
0.48 1.38
0.5 1.37

@lavalamp
Copy link
Member

@bprashanth is working on moving replication controller status computation into replication controller, which is going to have the side effect of making list rc not be O(#pods * #rc) anymore.

@lavalamp
Copy link
Member

#4675 will add GOMAXPROCS to the scheduler, which I missed yesterday.

@lavalamp
Copy link
Member

After #4429 lands, the times for replication controllers should drop down to be about the same as the pod times.

@timothysc
Copy link
Member

I'll paste my comment in this thread then:

When checking out the performance data on the apiserver under soak testing it appears that there is a fair amount of gc work going on:

20.43% kube-apiserver [.] runtime.MSpan_Sweep
3.91% kube-apiserver [.] findrunnable
3.05% kube-apiserver [.] runtime.mallocgc
2.29% kube-apiserver [.] scanblock
1.83% kube-apiserver [.] runtime.sweepone
1.75% kube-apiserver [.] runtime.atomicload
1.73% kube-apiserver [.] runtime.xchg
1.54% kube-apiserver [.] runtime.duffcopy
1.47% kube-apiserver [.] flushptrbuf
1.22% kube-apiserver [.] runtime.memclr
1.19% kube-apiserver [.] runqgrab

Seems related to: golang/go#9265

go version go1.3.3 linux/amd64

@timothysc
Copy link
Member

More color:

#4862 shows a rc fill time ~10 minutes.

A part of this lag is apiserver, but there are other issues in the system that we're tracing.

@lavalamp
Copy link
Member

@timothysc what process are you using to generate those numbers?

I think taking boundpods out will drastically speed adding & removing pods to the system, I think there's a lot of contention for updating that.

@timothysc
Copy link
Member

@lavalamp - #5260 is the load we used for the numbers above.

@davidopp
Copy link
Member Author

davidopp commented Jun 2, 2015

@fgrzadkowski There was a request to also test on g1-small since that's the smallest master users might run in practice. There's no expectation that the performance will be good -- we just want to make sure it is not unusable.

@fgrzadkowski
Copy link
Contributor

@piosz have enabled load test which creates :

  • 300 rc with 5 pods each
  • 25 rc with 30 pods each
  • 3 rc with 250 pods each
    All creations are spread across all first 10 minuts. After that we randomly resize/delete/recreate all RCs.

Metrics from the first run:

INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:1.50546s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.360146s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.022879s}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:481.39ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:400.466ms}

@davidopp
Copy link
Member Author

davidopp commented Jun 3, 2015

Thanks for the update! Can you indicate which PRs related to this issue are still out for review (other than the CoreOS one), if any?

@fgrzadkowski
Copy link
Contributor

CoreOS one is the only PR in-flight. Number in my previous comment where taken from jenkins.

@davidopp
Copy link
Member Author

davidopp commented Jun 3, 2015

CoreOS one is the only PR in-flight. Number in my previous comment where taken from jenkins.

Great, thanks! So the CoreOS PR will improve the above numbers further? (I assume "taken from jenkins" means only using what is already in the codebase -- but wanted to make sure I am understanding correctly.)

@fgrzadkowski
Copy link
Contributor

@davidopp Yes. I'll send a PR to update go-etcd client library version today.

@davidopp
Copy link
Member Author

davidopp commented Jun 8, 2015

@fgrzadkowski : Once you're confident we're meeting the goal described in the title of this issue, please move the issue to milestone "1.0-post". (We don't need to wait on the etcd PRs to merge if we're already meeting the goal.)

@davidopp davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015
@fgrzadkowski
Copy link
Contributor

I enabled load test. Currently it plays the following scenario:

  1. [0-10min] Run 300 RCs with 5 replicas each, 25 RCs with 30 pods each, 3 RC with 250 pods each
  2. [10-13min] Scale all RCs randomly and list pods within RC
  3. [13-16min] Repeat 2
  4. [16-26min] Delete all RC
    The test is configured in a way to keep ~5 creation/deletions per second, which IMO sounds reasonable. The test checks latency metrics after all of the phases to see if we meet 1s criteria. This means that some of requests were performed when the cluster was not full.

First run gave really good results (with 8 core master):

load test:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:707.259ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:382.948ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:239.901ms}
INFO: Top latency metric: {Verb:DELETE Resource:pods Quantile:0.99 Latency:193.937ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:161.933ms}

density 30 pods per node:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.60586s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:969.726ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:513.258ms}
INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:494.578ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:491.616ms}

I think we should pay more attention to the load test which plays more realistic scenario.

Tomorrow I will set thresholds to 1s for load test and 2s for density test.

@davidopp @brendanburns @wojtek-t Assuming that next runs will give similar results I suggest we announce victory. Does that sound reasonable to you?

@bprashanth
Copy link
Contributor

@fgrzadkowski just fyi, with the load test we resize rcs quickly so there's a chance you will run into #9147

@davidopp
Copy link
Member Author

davidopp commented Jun 9, 2015

Assuming that next runs will give similar results I suggest we announce victory. Does that sound reasonable to you?

Yes.

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 10, 2015
@davidopp davidopp reopened this Jun 12, 2015
@fgrzadkowski
Copy link
Contributor

@davidopp Results with 4 core master:

Density with 3 pods per node

INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:589.097ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:430.241ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:372.226ms}
INFO: Top latency metric: {Verb:POST Resource:bindings Quantile:0.99 Latency:359.243ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:302.09ms}

Density with 30 pods per node

INFO: WARNING Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:2.656382s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.269396s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:854.817ms}
INFO: Top latency metric: {Verb:POST Resource:pods Quantile:0.99 Latency:481.31ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:370.808ms}

Load test

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:575.679ms}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:372.226ms}
INFO: Top latency metric: {Verb:DELETE Resource:pods Quantile:0.99 Latency:190.719ms}
INFO: Top latency metric: {Verb:PUT Resource:nodes Quantile:0.99 Latency:152.142ms}
INFO: Top latency metric: {Verb:PUT Resource:replicationcontrollers Quantile:0.99 Latency:151.097ms}

NOTE: For load tests some requests were performed when we were creating/deleting RCs, so the cluster was not full (full spectrum from 0 to 30 pods per node)

@davidopp I suggest we keep 4 core master, but increase the threshold for density to 3 seconds. WDYT?

@davidopp
Copy link
Member Author

Sure, sounds good.

@davidopp
Copy link
Member Author

And thanks for collecting the additional data. That doesn't look too terrible vs. the 8-core data (i.e. it's not 2x the latency).

@davidopp
Copy link
Member Author

After merging #9862 the latency improved

#9862 (comment)
says

Results from density test with 30 pods per node in 100 node cluster with n1-standard-4 master:

INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.99 Latency:1.605515s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.9 Latency:1.270142s}
INFO: Top latency metric: {Verb:LIST Resource:pods Quantile:0.5 Latency:529.251ms}
INFO: Top latency metric: {Verb:POST Resource:bindings Quantile:0.99 Latency:444.361ms}
INFO: Top latency metric: {Verb:PUT Resource:pods Quantile:0.99 Latency:426.816ms}

This is way better (~25%) than the usual run for density test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests