Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get pods blocks 7-10 seconds if pods are being created at the same time #4196

Closed
alex-mohr opened this issue Feb 5, 2015 · 10 comments
Closed
Assignees
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@alex-mohr
Copy link
Contributor

If I fire off 20 create pod requests (in the background), then issue a get pods request, the api server becomes unresponsive for 7-10 seconds.

% time kubectl --server=https://130.211.162.155 --auth-path=/usr/local/google/home/mohr/.config/gcloud-dogfood/kubernetes/mohr-instance-manager.us-central1-f.prod10d/kubernetes_auth --insecure-skip-tls-verify=true get pods
^[[A
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
pause-pod-146e9dc7 pause-pod-146e9dc7 kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/ name=pause-pod-146e9dc7 Unknown
pause-pod-39afa37d 10.32.2.89 pause-pod-39afa37d kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-39afa37d Pending
pause-pod-43531a69 10.32.1.90 pause-pod-43531a69 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-43531a69 Pending
pause-pod-4d6b99cb 10.32.0.78 pause-pod-4d6b99cb kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 name=pause-pod-4d6b99cb Pending
pause-pod-66cb6cc0 10.32.0.80 pause-pod-66cb6cc0 kubernetes/pause k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 name=pause-pod-66cb6cc0 Pending
pause-pod-96aaba36 10.32.1.88 pause-pod-96aaba36 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-96aaba36 Pending
pause-pod-97b03e05 10.32.2.93 pause-pod-97b03e05 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-97b03e05 Running
pause-pod-b9709ceb pause-pod-b9709ceb kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-b9709ceb Unknown
pause-pod-c752b5c2 pause-pod-c752b5c2 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-c752b5c2 Unknown
pause-pod-eb5ecc36 10.32.1.91 pause-pod-eb5ecc36 kubernetes/pause k8s-prod10d-node-2.c.mohr-instance-manager.internal/130.211.164.97 name=pause-pod-eb5ecc36 Running
pause-pod-ebd98025 10.32.1.92 pause-pod-ebd98025 kubernetes/pause /130.211.164.97 name=pause-pod-ebd98025 Pending
pause-pod-f5aeaee3 10.32.2.92 pause-pod-f5aeaee3 kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-f5aeaee3 Running
pause-pod-f698b56f 10.32.2.90 pause-pod-f698b56f kubernetes/pause k8s-prod10d-node-3.c.mohr-instance-manager.internal/146.148.87.193 name=pause-pod-f698b56f Running
skydns-fplln 10.32.0.3 etcd quay.io/coreos/etcd:latest k8s-prod10d-node-1.c.mohr-instance-manager.internal/104.154.60.240 k8s-app=skydns Running
kube2sky kubernetes/kube2sky:1.0
skydns kubernetes/skydns:2014-12-23-001
0.038u 0.011s 0:07.63 0.5% 0+0k 0+0io 0pf+0w
(or http://pastebin.com/Z5JipRhB)

If I issue a "get pod skydns-fplln", the api request doesn't seem to block.

@alex-mohr alex-mohr added priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/master labels Feb 5, 2015
@gmarek
Copy link
Contributor

gmarek commented Feb 10, 2015

I run some experiments and I kind-of reproduced this problem. I run in one terminal
for i in seq N; do cluster/kubectl.sh run-container pause$i --image=kubernetes/pause --port=1234 & done
and
time cluster/kubectl.sh get pods
in a second

for N-s from 10 to 100. It was the case that get pods was running for 6-7 seconds on average, but it wasn't noticeably increasing with N (which I guess is interesting).

I'm not very sure of those results though, because kubectl was having problems with starting/stopping more than ~30 pods in background. It was failing some starts (~5%) and a lot of stops. Number of failed stops was drastically increasing with N, which may be caused by the cluster overload (I have 4 minions in it), but I'm not familiar enough with architecture if know if it's expected or not.

@gmarek
Copy link
Contributor

gmarek commented Feb 12, 2015

I'm not sure if I correctly understood the code, so please correct me if I'm wrong. My take is that this kind of behaviour is inherent for current implementation. IIUC the workflow for 'get pods' calls List method from pod/rest.go consists roughly of two steps: getting the list of the pods from etcd in the form of running ListPodsPredicate(..) and then running GetPodStatus on PodCache for each pod returned by List... (I wasn't able to find implementation of registry other than etcd one).

I added additional logging and found out that etcd queries, are fast enough (~0.1-0.2 sec), but going through the cache is pretty slow (easily over 2 sec). I'm not sure what's taking so long, but I suspect some lock congestion.

@ghost ghost self-assigned this Feb 16, 2015
@ghost
Copy link

ghost commented Feb 17, 2015

Get pods generally have a HUGE variance in a running time (easily from 1.5 to 6 sec) on the same cluster state, so I don't think create/get interaction is to blame, but the get itself is problematic.

Profiling APIserver does not show anything interesting, as roughly all the CPU time is used by JSON handling.

I analyzed logs with additional debug info added, and now, contrary to my previous comment, found slow etcd reads:

W0217 12:06:11.804762    5018 rest.go:134] Listing pods 1795005581: run-container=test23, 
W0217 12:06:11.804777    5018 etcd.go:114] Entering list pods: 619754496
W0217 12:06:11.804783    5018 etcd.go:117] Extracting to list: 619754496
W0217 12:06:11.804789    5018 etcd_tools.go:222] Starting ExtractToList: /registry/pods/default <- 186049545
W0217 12:06:11.804797    5018 etcd_tools.go:225] GotItemsPtr: 186049545
W0217 12:06:11.804803    5018 etcd_tools.go:172] Starting list extract for /registry/pods/default: 646422472
...
W0217 12:06:12.313804    5018 etcd_tools.go:174] After listetcdnode: 646422472
W0217 12:06:12.313850    5018 etcd_tools.go:181] Returning %v: %d
[0xc2094f8060 0xc2094f80c0 0xc2094f8120 0xc2094f8180 0xc2094f81e0 0xc2094f8240 0xc2094f82a0 0xc2094f8300 0xc2094f8360 0xc2094f83c0 0xc2094f8420 0xc2094f8480 0xc2094f84e0 0xc2094f8540 0xc2094f85a0 0xc2094f8600 0xc2094f8660 0xc2094f86c0 0xc2094f8720 0xc2094f8780 0xc2094f87e0 0xc2094f8840 0xc2094f88a0 0xc2094f8900 0xc2094f8960 0xc2094f89c0 0xc2094f8a20 0xc2094f8a80 0xc2094f8ae0 0xc2094f8b40 0xc2094f8ba0 0xc2094f8c00 0xc2094f8c60 0xc2094f8cc0 0xc2094f8d20 0xc2094f8d80 0xc2094f8de0 0xc2094f8e40 0xc2094f8ea0 0xc2094f8f00 0xc2094f8f60 0xc2094f8fc0 0xc2094f9020 0xc2094f9080 0xc2094f90e0 0xc2094f9140 0xc2094f91a0 0xc2094f9200 0xc2094f9260 0xc2094f92c0 0xc2094f9320 0xc2094f9380] 646422472
W0217 12:06:12.313893    5018 etcd_tools.go:188] Starting decode... 575711784
W0217 12:06:12.313902    5018 etcd_tools.go:194] Enforced...575711784
...
W0217 12:06:14.515260    5018 etcd_tools.go:214] Decoded...575711784: processed 52 nodes in %!l(int64=2201345705)d ns
W0217 12:06:14.515317    5018 etcd_tools.go:232] Extrated list: 186049545
W0217 12:06:14.515344    5018 etcd.go:119] Extracted to list: 619754496
W0217 12:06:14.515528    5018 etcd.go:130] Returning: 619754496
W0217 12:06:14.515538    5018 rest.go:136] Got pods 1795005581: &{{ } { 813} [{{ } {test23-febia test23- default  57614b98-b69d-11e4-86bb-42010af04153 518 2015-02-17 12:06:01 +0000 UTC map[run-container:test23] map[]} {[] [{test23 kubernetes/pause []  [{ 0 12345 TCP }] [] {map[]} [] <nil> <nil> <nil> /dev/termination-log false IfNotPresent {[] []}}] {0x14a5590 <nil> <nil>} ClusterFirst map[] } {Pending []  kubernetes-minion-4.c.gmarek-kubernetes.internal   map[]}}]}

This shows that we loose most of the time in getting single pod in the for loop in etcd_tools.go:decodeNodeList(...) method. On average it took 0.042 sec per node, and processed 52 of them, which gives this 2 sec duration.

This leaves us with a couple of questions: why there were 52 nodes to process there, can it be parallelelized and why it took 42 ms to process a single one.

@ghost
Copy link

ghost commented Feb 18, 2015

Someone brought into my attention the fact, that by default on GCE we run on single thread. This means that even if there are multiple HTTP servers running in separate go routines, physically they run interleaved on a single thread. This means that doing few things in the same time has to make everything noticeably slower, but maybe not to this magnitude. But if we add the fact that on the same machine there's a lot other things running (etcd, replication controllers) it's starts to be close.

I run an experiment with 4 thread master, which reduced latency by ~33% for pretty much everything I checked, but still, once in a while, some calls took much more time to finish.

@bgrant0607
Copy link
Member

Yes, we should run with multiple threads.

Also, rather than just profiling cpu time, it may be more useful to profiling blocking time.

go tool pprof http://localhost:8080/debug/pprof/block

@lavalamp
Copy link
Member

Yeah, I suspect the problem is mostly contention for a lock or etcd or RPCs.

Adding a pod causes multiple etcd read/writes, as scheduler will be watching and (hopefully) quickly bind the pod.

@gmarek
Copy link
Contributor

gmarek commented Feb 19, 2015

I looked at contention profiles but found nothing interesting.

@lavalamp
Copy link
Member

lavalamp commented Mar 3, 2015

This would be good to add to our metrics for #4521

@davidopp davidopp added this to the v1.0 milestone Mar 19, 2015
@rjnagal rjnagal self-assigned this Mar 20, 2015
@gmarek
Copy link
Contributor

gmarek commented Mar 20, 2015

FYI: I was trying to reproduce this problem last week and failed.

@lavalamp
Copy link
Member

I expect that this is fixed now that we removed boundPods. I'll close this since gmarek couldn't repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants