-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test single minion with many pods. #4119
Comments
I will start looking into it. |
As a quick update - I was experimenting with resizing ReplicationController from 0 to 50 today. My experiments show that the whole operation takes ~2 minutes, but after <10 seconds all the pods are reported as pending (i.e. scheduled). So I will continue investigating (tomorrow) what exactly takes time to start them on the node. |
See the other "status is flaky" thread for 2 leads on docker problems. Note
|
Sure - I'm not going to experiment with more than 50 pods. But even with 50 pods, the current results aren't satisfying, I believe... |
Sure, I believe that, I just wanted to level-set :)
|
@wojtek-t : I was also looking at scaling until I got deflected with a P1 Fixit (I hope to emerge from the Fixit this week sometime). It might be a good idea to synchronize about experiments etc. and avoid duplicating work? Most of my effort up to the end of last year went into just creating large clusters, but clearly we can identify plenty of issues using very small clusters and many pods. |
@satnam6502 SGTM |
Excellent! I've been wondering about setting up some kind of page/doc for tracking how performance evolves as we make changes and it would be good to have some kind of baseline to compare to. I think several of us have done some ad hoc experiments but it would be good to devise a systematic set of experiments/measurements and then track them (e.g. like what you report in this thread about the time taken to launch pods). |
Yes - i think it would be useful. I will setup a meeting. |
Or we can start working on a doc and check it in somewhere in the repo so everyone else can follow along and offer suggestions? |
I've spent significant amount of time looking into the performance problem today and already have some findings. Basically, I continued investigating where the time is spent with the experiment of resizing ReplicationController from 0 to 50. I think that what we should do here is to try to batch the incoming pods whenever we can. In other words if there is more than one pod in "updates" channel waiting for processing, I think we shouldn't process them one by one, but try to process all of them at the same time. |
An option to improve the latency of GetKubeletDockerContainers is to query cAdvisor instead of Docker for the containers. It should have the containers within milliseconds of the container being created by Docker. I'm not sure it will matter that it is not the "authoritative" source since even today there is a race between listing and the world changing under us. |
Interesting @vmarmol : so basically we can view cAdvisor as being a cache for such information? We could then use it for rapid access to lots of information? |
That's what I had in mind @satnam6502. I need to experiment whether the "staleness" of the data affects us much, but I don't think it will. |
I'll try to experiment with this. Unless @wojtek-t would rather do that part too :) |
Very exciting. Let us know how we can help! |
Isn't this sort of a hack? Like, shouldn't kubelet itself cache this, On Tue, Feb 17, 2015 at 7:49 AM, Victor Marmol notifications@github.com
|
We certainly can cache this information in the Kubelet and this will only be slightly more stale than cAdvisor's. Whether cAdvisor is optional is a discussion on its own, but I don't think it actually is. We're baking in a lot of resource awareness at many levels and we won't be able to do any of that without cAdvisor. I just don't think we've officially accepted that yet :) You can argue that it still is optional with degraded feature set, but that may be hard to test for and at what point is the degraded feature set not worthwhile. |
@wojtek-t: Is it the docker daemon that is causing such high latency? |
From looking at the GetKubeletDockerContainers it looks like the only real work we do is query Docker. It does seem unusual that the latency is so high, but I've observed it before. |
@vishh: I think that vmarmol@ post already answered your question. @davidopp: yes, I think that's definitely related. In fact when I was doing my experiments, usually all the pods (i.e. 50 of them) were pending within few seconds (say 5-10), and the remaining minute or more was spend on starting them on the Kubelet. So yes - I didn't experiment with scheduling latency so far, but my experiments show that this isn't the major factor of startup time. |
Thanks for the running those experiments @wojtek-t. This is looking like a promising first step :) I agree we need to examine our startup goals. If we need to download a Docker image, that itself will take a while and will dominate all other aspects. The Docker container startup itself is also time consuming. I measured ~250ms startup time from a scratch image (which should be a best-case). With 100 containers started serially, this itself is 25s. The problem you observed with SyncPods is what I was worried about when we introduced higher latency between us an the "source". I am surprised that Docker doesn't already include the container when "create" finishes though. Your suggested approach sounds good, but lets put it in the Kubelet rather than the cache since it will interact with update intervals and the like. |
What does cadvisor do that this cache can't? I.e. why is it going to be As for knowing whether we already know a pod or not, could we (eventually) // Track all running per-pod goroutines func SyncPods(...) { func podManagerLoop(pod api.Pod, updates chan api.Pod) { It will still race if kubelet crashes in the middle of setting things up, On Thu, Feb 19, 2015 at 10:26 AM, Victor Marmol notifications@github.com
|
@thockin the cache is getting the information straight from Docker (which has higher latency) and holding the data for some time window. The result you get will be as old as that time window. cAdvisor observes the state straight from the kernel and is informed of changes within milliseconds of their occurrence. Thus that time window is much smaller. Given a small enough time window the advantage does not matter. Thread-per-pod looks good :) The case of the Kubelet crashing should be okay since we re-examine all our state when we come back up anyways. |
Can't you bundles the cadvisor "watch the kernel" logic into a reusable pkg? On Thu, Feb 19, 2015 at 2:37 PM, Victor Marmol notifications@github.com
|
I mean we can package all of cAdvisor and run it inside Kubelet :) I'm exaggerating a bit, but I just mean that at some point we have a clear separation of responsibilities. We'll need to decide where that is. That's probably out of scope for this issue though. |
It just seems potentially useful to have a package that says "I know how to On Thu, Feb 19, 2015 at 3:23 PM, Victor Marmol notifications@github.com
|
I totally agree with Tim, that having a package answering the above question would help the problem we're facing here. Regarding "thread-per-pod" idea, I also thought about something very similar it at some point. But my fear regarding it was that if there are few multiple updates to the same pod within very short time (milliseconds), we would be processing all of the one by one. However, it we assume that there are already few of them in the channel - it should be possible to process only the last one (i.e. the most recent one). But maybe this scenario is too unreal to think about it - what do you think? |
Any one individual pod will most likely not see multiple update within
|
Thanks Victor. SGTM. |
This is the same fundamental model that Omlet uses and was pretty carefully @dchen1107 as final arbiter On Thu, Feb 19, 2015 at 11:20 PM, wojtek-t notifications@github.com wrote:
|
I couldn't understand why I missed this discussion at the first place. Thank @wojtek-t for running those experiments and get those detail data. The idea of "thread-per-pod" should work and it is what our internal system aims for to solve performance and scalability issues. Next step, PodManagerLoop could generate PodStatus, and cache somewhere; Kubelet periodically POST NodeStatus and PodStatus to apiserver. One potential catch we might hit is that at a given time, multiple pods might pull the same image from docker repository. Last time when I measured, the latency of simultaneous pullings of the same image is siginificantly higher than a single pulling. I didn't look into it deeply then, but that is a while back, docker might fix such issue. I will measure it tomorrow. |
Thanks @dchen1107 Regarding simulaneous pullings of the same image - I think it shouldn't make things worse (at least in the currently existing code). Basically, the "pullImage" Kubelet method is protected by "pullLock", so it ensures that we won't be pulling any two images simultaneously. We can leave this for now - what do you think? |
So I will try to experiment with "thread-per-pod" approach anyway. Will post into this thread when I will have some findings. |
To be clear, thread-per-pod should not be a critical goal by itself. If it On Fri, Feb 20, 2015 at 1:12 AM, wojtek-t notifications@github.com wrote:
|
Sure - but I think that thanks to it we can avoid starting unnecessary pods, which can significantly improve performance. I will write some prototype and try to measure gains - if that won't be satisfying I will just drop it (for now). |
Docker does its own locking for parallel pulls so it shouldn't be a problem for now. |
Its surprising that parallel pulls result in higher latency. As @vmarmol On Fri, Feb 20, 2015 at 7:39 AM, Victor Marmol notifications@github.com
|
@vmarmol: Thanks! So do you suggest that "pullLock" in Kubelet is completely useless now? If so I'm happy to remove it. Regarding "thread-per-pod" discussion. I've implemented very simple prototype of it (i.e. some unit tests are failing etc.), but it works in simple cases (that I'm experimenting with here). And (at least with this simple prototype) it seems to give us quite a big gains. In particular, resizing replication controller from 0 to 50 with the new code (and with docke cache patched) seems to now finish within 35-40 seconds (which I think may be quite close to our goal). |
Great news @wojtek-t! I'd be happy to see the PR whenever you have it ready. After reading the description for pullLock it looks like we do need it, but it seems we've been using it wrong :-\ I'll go ahead and fix that today. The lock doesn't protect from multiple pulls at the same time, rather from pulls and deletes racing. |
Is there more work to be done on this? |
Yes - at least I need to add a unit test for podworkers which I didn't manage to do before my vacation. Will work on it this week. |
From another thread proposed by @thockin:
"make a cluster of 1 minion, create
an RC of 50, then ramp that up in increments of 50 until it breaks.
Go figure out why it broke (I already mentioned 2 node-related issues
in the other thread) and repeat until we can get 250 pods on a machine
with none of them randomly crashing or getting killed.
Then focus on making apiserver not suck for listing those 250.
Then focus on making scheduler not suck for scheduling 250 new pods quickly.
"
The text was updated successfully, but these errors were encountered: