get pods blocks 7-10 seconds if pods are being created at the same time #4196

alex-mohr · 2015-02-05T23:59:43Z

If I fire off 20 create pod requests (in the background), then issue a get pods request, the api server becomes unresponsive for 7-10 seconds.

% time kubectl --server=https://130.211.162.155 --auth-path=/usr/local/google/home/mohr/.config/gcloud-dogfood/kubernetes/mohr-instance-manager.us-central1-f.prod10d/kubernetes_auth --insecure-skip-tls-verify=true get pods
^[[A
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
pause-pod-146e9dc7 pause-pod-146e9dc7 kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/ name=pause-pod-146e9dc7 Unknown
pause-pod-39afa37d 10.32.2.89 pause-pod-39afa37d kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-39afa37d Pending
pause-pod-43531a69 10.32.1.90 pause-pod-43531a69 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-43531a69 Pending
pause-pod-4d6b99cb 10.32.0.78 pause-pod-4d6b99cb kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 name=pause-pod-4d6b99cb Pending
pause-pod-66cb6cc0 10.32.0.80 pause-pod-66cb6cc0 kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 name=pause-pod-66cb6cc0 Pending
pause-pod-96aaba36 10.32.1.88 pause-pod-96aaba36 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-96aaba36 Pending
pause-pod-97b03e05 10.32.2.93 pause-pod-97b03e05 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-97b03e05 Running
pause-pod-b9709ceb pause-pod-b9709ceb kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-b9709ceb Unknown
pause-pod-c752b5c2 pause-pod-c752b5c2 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-c752b5c2 Unknown
pause-pod-eb5ecc36 10.32.1.91 pause-pod-eb5ecc36 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-eb5ecc36 Running
pause-pod-ebd98025 10.32.1.92 pause-pod-ebd98025 kubernetes/pause /130.211.164.97 name=pause-pod-ebd98025 Pending
pause-pod-f5aeaee3 10.32.2.92 pause-pod-f5aeaee3 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-f5aeaee3 Running
pause-pod-f698b56f 10.32.2.90 pause-pod-f698b56f kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-f698b56f Running
skydns-fplln 10.32.0.3 etcd quay.io/coreos/etcd:latest k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 k8s-app=skydns Running
kube2sky kubernetes/kube2sky:1.0
skydns kubernetes/skydns:2014-12-23-001
0.038u 0.011s 0:07.63 0.5% 0+0k 0+0io 0pf+0w
(or http://pastebin.com/Z5JipRhB)

If I issue a "get pod skydns-fplln", the api request doesn't seem to block.

gmarek · 2015-02-10T22:08:54Z

I run some experiments and I kind-of reproduced this problem. I run in one terminal
for i in seq N; do cluster/kubectl.sh run-container pause$i --image=kubernetes/pause --port=1234 & done
and
time cluster/kubectl.sh get pods
in a second

for N-s from 10 to 100. It was the case that get pods was running for 6-7 seconds on average, but it wasn't noticeably increasing with N (which I guess is interesting).

I'm not very sure of those results though, because kubectl was having problems with starting/stopping more than ~30 pods in background. It was failing some starts (~5%) and a lot of stops. Number of failed stops was drastically increasing with N, which may be caused by the cluster overload (I have 4 minions in it), but I'm not familiar enough with architecture if know if it's expected or not.

gmarek · 2015-02-12T23:39:13Z

I'm not sure if I correctly understood the code, so please correct me if I'm wrong. My take is that this kind of behaviour is inherent for current implementation. IIUC the workflow for 'get pods' calls List method from pod/rest.go consists roughly of two steps: getting the list of the pods from etcd in the form of running ListPodsPredicate(..) and then running GetPodStatus on PodCache for each pod returned by List... (I wasn't able to find implementation of registry other than etcd one).

I added additional logging and found out that etcd queries, are fast enough (~0.1-0.2 sec), but going through the cache is pretty slow (easily over 2 sec). I'm not sure what's taking so long, but I suspect some lock congestion.

ghost · 2015-02-17T12:33:02Z

Get pods generally have a HUGE variance in a running time (easily from 1.5 to 6 sec) on the same cluster state, so I don't think create/get interaction is to blame, but the get itself is problematic.

Profiling APIserver does not show anything interesting, as roughly all the CPU time is used by JSON handling.

I analyzed logs with additional debug info added, and now, contrary to my previous comment, found slow etcd reads:

W0217 12:06:11.804762    5018 rest.go:134] Listing pods 1795005581: run-container=test23, 
W0217 12:06:11.804777    5018 etcd.go:114] Entering list pods: 619754496
W0217 12:06:11.804783    5018 etcd.go:117] Extracting to list: 619754496
W0217 12:06:11.804789    5018 etcd_tools.go:222] Starting ExtractToList: /registry/pods/default <- 186049545
W0217 12:06:11.804797    5018 etcd_tools.go:225] GotItemsPtr: 186049545
W0217 12:06:11.804803    5018 etcd_tools.go:172] Starting list extract for /registry/pods/default: 646422472
...
W0217 12:06:12.313804    5018 etcd_tools.go:174] After listetcdnode: 646422472
W0217 12:06:12.313850    5018 etcd_tools.go:181] Returning %v: %d
[0xc2094f8060 0xc2094f80c0 0xc2094f8120 0xc2094f8180 0xc2094f81e0 0xc2094f8240 0xc2094f82a0 0xc2094f8300 0xc2094f8360 0xc2094f83c0 0xc2094f8420 0xc2094f8480 0xc2094f84e0 0xc2094f8540 0xc2094f85a0 0xc2094f8600 0xc2094f8660 0xc2094f86c0 0xc2094f8720 0xc2094f8780 0xc2094f87e0 0xc2094f8840 0xc2094f88a0 0xc2094f8900 0xc2094f8960 0xc2094f89c0 0xc2094f8a20 0xc2094f8a80 0xc2094f8ae0 0xc2094f8b40 0xc2094f8ba0 0xc2094f8c00 0xc2094f8c60 0xc2094f8cc0 0xc2094f8d20 0xc2094f8d80 0xc2094f8de0 0xc2094f8e40 0xc2094f8ea0 0xc2094f8f00 0xc2094f8f60 0xc2094f8fc0 0xc2094f9020 0xc2094f9080 0xc2094f90e0 0xc2094f9140 0xc2094f91a0 0xc2094f9200 0xc2094f9260 0xc2094f92c0 0xc2094f9320 0xc2094f9380] 646422472
W0217 12:06:12.313893    5018 etcd_tools.go:188] Starting decode... 575711784
W0217 12:06:12.313902    5018 etcd_tools.go:194] Enforced...575711784
...
W0217 12:06:14.515260    5018 etcd_tools.go:214] Decoded...575711784: processed 52 nodes in %!l(int64=2201345705)d ns
W0217 12:06:14.515317    5018 etcd_tools.go:232] Extrated list: 186049545
W0217 12:06:14.515344    5018 etcd.go:119] Extracted to list: 619754496
W0217 12:06:14.515528    5018 etcd.go:130] Returning: 619754496
W0217 12:06:14.515538    5018 rest.go:136] Got pods 1795005581: &{{ } { 813} [{{ } {test23-febia test23- default  57614b98-b69d-11e4-86bb-42010af04153 518 2015-02-17 12:06:01 +0000 UTC map[run-container:test23] map[]} {[] [{test23 kubernetes/pause []  [{ 0 12345 TCP }] [] {map[]} [] <nil> <nil> <nil> /dev/termination-log false IfNotPresent {[] []}}] {0x14a5590 <nil> <nil>} ClusterFirst map[] } {Pending []  kubernetes-minion-4.c.gmarek-kubernetes.internal   map[]}}]}

This shows that we loose most of the time in getting single pod in the for loop in etcd_tools.go:decodeNodeList(...) method. On average it took 0.042 sec per node, and processed 52 of them, which gives this 2 sec duration.

This leaves us with a couple of questions: why there were 52 nodes to process there, can it be parallelelized and why it took 42 ms to process a single one.

ghost · 2015-02-18T12:25:36Z

Someone brought into my attention the fact, that by default on GCE we run on single thread. This means that even if there are multiple HTTP servers running in separate go routines, physically they run interleaved on a single thread. This means that doing few things in the same time has to make everything noticeably slower, but maybe not to this magnitude. But if we add the fact that on the same machine there's a lot other things running (etcd, replication controllers) it's starts to be close.

I run an experiment with 4 thread master, which reduced latency by ~33% for pretty much everything I checked, but still, once in a while, some calls took much more time to finish.

bgrant0607 · 2015-02-18T15:19:56Z

Yes, we should run with multiple threads.

Also, rather than just profiling cpu time, it may be more useful to profiling blocking time.

go tool pprof http://localhost:8080/debug/pprof/block

lavalamp · 2015-02-18T17:20:21Z

Yeah, I suspect the problem is mostly contention for a lock or etcd or RPCs.

Adding a pod causes multiple etcd read/writes, as scheduler will be watching and (hopefully) quickly bind the pod.

gmarek · 2015-02-19T16:18:12Z

I looked at contention profiles but found nothing interesting.

lavalamp · 2015-03-03T20:48:11Z

This would be good to add to our metrics for #4521

gmarek · 2015-03-20T07:29:04Z

FYI: I was trying to reproduce this problem last week and failed.

lavalamp · 2015-03-26T23:00:34Z

I expect that this is fixed now that we removed boundPods. I'll close this since gmarek couldn't repro.

alex-mohr added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/master labels Feb 5, 2015

ghost self-assigned this Feb 16, 2015

ghost mentioned this issue Feb 18, 2015

Make 99% of API calls return in less than 1s; constant time to number of nodes and pods #4521

Closed

davidopp added this to the v1.0 milestone Mar 19, 2015

rjnagal self-assigned this Mar 20, 2015

lavalamp closed this as completed Mar 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get pods blocks 7-10 seconds if pods are being created at the same time #4196

get pods blocks 7-10 seconds if pods are being created at the same time #4196

alex-mohr commented Feb 5, 2015

gmarek commented Feb 10, 2015

gmarek commented Feb 12, 2015

ghost commented Feb 17, 2015

ghost commented Feb 18, 2015

bgrant0607 commented Feb 18, 2015

lavalamp commented Feb 18, 2015

gmarek commented Feb 19, 2015

lavalamp commented Mar 3, 2015

gmarek commented Mar 20, 2015

lavalamp commented Mar 26, 2015

get pods blocks 7-10 seconds if pods are being created at the same time #4196

get pods blocks 7-10 seconds if pods are being created at the same time #4196

Comments

alex-mohr commented Feb 5, 2015

gmarek commented Feb 10, 2015

gmarek commented Feb 12, 2015

ghost commented Feb 17, 2015

ghost commented Feb 18, 2015

bgrant0607 commented Feb 18, 2015

lavalamp commented Feb 18, 2015

gmarek commented Feb 19, 2015

lavalamp commented Mar 3, 2015

gmarek commented Mar 20, 2015

lavalamp commented Mar 26, 2015