Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test single minion with many pods. #4119

Closed
dchen1107 opened this issue Feb 4, 2015 · 48 comments
Closed

Test single minion with many pods. #4119

dchen1107 opened this issue Feb 4, 2015 · 48 comments
Assignees
Labels
area/reliability area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@dchen1107
Copy link
Member

From another thread proposed by @thockin:

"make a cluster of 1 minion, create
an RC of 50, then ramp that up in increments of 50 until it breaks.
Go figure out why it broke (I already mentioned 2 node-related issues
in the other thread) and repeat until we can get 250 pods on a machine
with none of them randomly crashing or getting killed.

Then focus on making apiserver not suck for listing those 250.

Then focus on making scheduler not suck for scheduling 250 new pods quickly.

"

@dchen1107 dchen1107 added area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. area/reliability labels Feb 4, 2015
@wojtek-t wojtek-t self-assigned this Feb 16, 2015
@wojtek-t
Copy link
Member

I will start looking into it.

@wojtek-t
Copy link
Member

As a quick update - I was experimenting with resizing ReplicationController from 0 to 50 today. My experiments show that the whole operation takes ~2 minutes, but after <10 seconds all the pods are reported as pending (i.e. scheduled). So I will continue investigating (tomorrow) what exactly takes time to start them on the node.

@thockin
Copy link
Member

thockin commented Feb 16, 2015

See the other "status is flaky" thread for 2 leads on docker problems. Note
that our explicit goal is 30 - 50 pods per machine, for now. More than
that is a waste of time - file bugs and move on.
On Feb 16, 2015 7:30 AM, "wojtek-t" notifications@github.com wrote:

As a quick update - I was experimenting with resizing
ReplicationController from 0 to 50 today. My experiments show that the
whole operation takes ~2 minutes, but after <10 seconds all the pods are
reported as pending (i.e. scheduled). So I will continue investigating
(tomorrow) what exactly takes time to start them on the node.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@wojtek-t
Copy link
Member

Sure - I'm not going to experiment with more than 50 pods. But even with 50 pods, the current results aren't satisfying, I believe...

@thockin
Copy link
Member

thockin commented Feb 16, 2015

Sure, I believe that, I just wanted to level-set :)
On Feb 16, 2015 8:56 AM, "wojtek-t" notifications@github.com wrote:

Sure - I'm not going to experiment with more than 50 pods. But even with
50 pods, the current results aren't satisfying, I believe...

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@satnam6502
Copy link
Contributor

@wojtek-t : I was also looking at scaling until I got deflected with a P1 Fixit (I hope to emerge from the Fixit this week sometime). It might be a good idea to synchronize about experiments etc. and avoid duplicating work? Most of my effort up to the end of last year went into just creating large clusters, but clearly we can identify plenty of issues using very small clusters and many pods.

@wojtek-t
Copy link
Member

@satnam6502 SGTM

@satnam6502
Copy link
Contributor

Excellent! I've been wondering about setting up some kind of page/doc for tracking how performance evolves as we make changes and it would be good to have some kind of baseline to compare to. I think several of us have done some ad hoc experiments but it would be good to devise a systematic set of experiments/measurements and then track them (e.g. like what you report in this thread about the time taken to launch pods).

@wojtek-t
Copy link
Member

Yes - i think it would be useful. I will setup a meeting.

@satnam6502
Copy link
Contributor

Or we can start working on a doc and check it in somewhere in the repo so everyone else can follow along and offer suggestions?

@wojtek-t
Copy link
Member

I've spent significant amount of time looking into the performance problem today and already have some findings. Basically, I continued investigating where the time is spent with the experiment of resizing ReplicationController from 0 to 50.
It seems that the majority of walltime (more than 90%) at the Kubelet side is spend in:
GetKubeletDockerContainers
function (which means it is spend on getting information about containers from Docker). I've seen examples that a single call can take up to 2 seconds.
This wouldn't be a huge problem (I think we can afford 2 seconds latency in starting containers), but the real problem here is that this is strictly connected with the code from:
syncLoop
Kubelet function. This basically reads new pods from "updates" channel one by one and after each read performs the SyncPods function (which internally calls GetKubeletDockerContainers).
This basically means that when I'm starting 50 pods at the same time on the same Kubelet, they will be started one by one, so the start the last one, the GetKubeletDockerContainers will need to be called 50 times.

I think that what we should do here is to try to batch the incoming pods whenever we can. In other words if there is more than one pod in "updates" channel waiting for processing, I think we shouldn't process them one by one, but try to process all of them at the same time.
I will try to prepare a PR for it (but probably it will happen tomorrow)

@vmarmol
Copy link
Contributor

vmarmol commented Feb 17, 2015

An option to improve the latency of GetKubeletDockerContainers is to query cAdvisor instead of Docker for the containers. It should have the containers within milliseconds of the container being created by Docker. I'm not sure it will matter that it is not the "authoritative" source since even today there is a race between listing and the world changing under us.

@satnam6502
Copy link
Contributor

Interesting @vmarmol : so basically we can view cAdvisor as being a cache for such information? We could then use it for rapid access to lots of information?

@vmarmol
Copy link
Contributor

vmarmol commented Feb 17, 2015

That's what I had in mind @satnam6502. I need to experiment whether the "staleness" of the data affects us much, but I don't think it will.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 17, 2015

I'll try to experiment with this. Unless @wojtek-t would rather do that part too :)

@satnam6502
Copy link
Contributor

Very exciting. Let us know how we can help!

@thockin
Copy link
Member

thockin commented Feb 17, 2015

Isn't this sort of a hack? Like, shouldn't kubelet itself cache this,
rather than depending on cadvisor (which is strictly optional, is it not?)

On Tue, Feb 17, 2015 at 7:49 AM, Victor Marmol notifications@github.com
wrote:

An option to improve the latency of GetKubeletDockerContainers is to query
cAdvisor instead of Docker for the containers. It should have the
containers within milliseconds of the container being created by Docker.
I'm not sure it will matter that it is not the "authoritative" source since
even today there is a race between listing and the world changing under us.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 17, 2015

We certainly can cache this information in the Kubelet and this will only be slightly more stale than cAdvisor's.

Whether cAdvisor is optional is a discussion on its own, but I don't think it actually is. We're baking in a lot of resource awareness at many levels and we won't be able to do any of that without cAdvisor. I just don't think we've officially accepted that yet :) You can argue that it still is optional with degraded feature set, but that may be hard to test for and at what point is the degraded feature set not worthwhile.

@vishh
Copy link
Contributor

vishh commented Feb 17, 2015

@wojtek-t: Is it the docker daemon that is causing such high latency?

@vmarmol
Copy link
Contributor

vmarmol commented Feb 17, 2015

From looking at the GetKubeletDockerContainers it looks like the only real work we do is query Docker. It does seem unusual that the latency is so high, but I've observed it before.

@davidopp
Copy link
Member

I wonder if this is related to #3954
Although that issue talks about scheduler throughput, it seems that what Alex observed is end-to-end scheduling time, so the delay may be related to this issue and not actually in the scheduler. @wojtek-t what do you think?

@wojtek-t
Copy link
Member

@vishh: I think that vmarmol@ post already answered your question.

@davidopp: yes, I think that's definitely related. In fact when I was doing my experiments, usually all the pods (i.e. 50 of them) were pending within few seconds (say 5-10), and the remaining minute or more was spend on starting them on the Kubelet. So yes - I didn't experiment with scheduling latency so far, but my experiments show that this isn't the major factor of startup time.

@wojtek-t
Copy link
Member

@thockin @vmarmol : I'm happy to work on adding any kind of cache, but I probably need a hint whether I should proceed with cAdvisor or create some custom cache in Kubelet. What do you think?

@vmarmol
Copy link
Contributor

vmarmol commented Feb 19, 2015

Thanks for the running those experiments @wojtek-t. This is looking like a promising first step :)

I agree we need to examine our startup goals. If we need to download a Docker image, that itself will take a while and will dominate all other aspects. The Docker container startup itself is also time consuming. I measured ~250ms startup time from a scratch image (which should be a best-case). With 100 containers started serially, this itself is 25s.

The problem you observed with SyncPods is what I was worried about when we introduced higher latency between us an the "source". I am surprised that Docker doesn't already include the container when "create" finishes though. Your suggested approach sounds good, but lets put it in the Kubelet rather than the cache since it will interact with update intervals and the like.

@thockin
Copy link
Member

thockin commented Feb 19, 2015

What does cadvisor do that this cache can't? I.e. why is it going to be
"more stale" ?

As for knowing whether we already know a pod or not, could we (eventually)
do something like the below? It's probably too large to do as a first
step...

// Track all running per-pod goroutines
var knownPods map[UID]chan api.Pod

func SyncPods(...) {
for each pod in configs {
if mgr, exists := knownPods[pod.UID]; !exists {
mgr <- pod
// test if mgr is closed, if so treat this as new
} else {
c := new(chan api.Pod)
knownPods[pod.UID] = c
go podManagerLoop(pod, c)
}
}
}

func podManagerLoop(pod api.Pod, updates chan api.Pod) {
defer close(updates)
for {
desired <- updates
if current != desired {
actuate()
}
}
}

It will still race if kubelet crashes in the middle of setting things up,
but it should not race otherwise...

@dchen1107

On Thu, Feb 19, 2015 at 10:26 AM, Victor Marmol notifications@github.com
wrote:

Thanks for the running those experiments @wojtek-t
https://github.com/wojtek-t. This is looking like a promising first
step :)

I agree we need to examine our startup goals. If we need to download a
Docker image, that itself will take a while and will dominate all other
aspects. The Docker container startup itself is also time consuming. I
measured ~250ms startup time from a scratch image (which should be a
best-case). With 100 containers started serially, this itself is 25s.

The problem you observed with SyncPods is what I was worried about when we
introduced higher latency between us an the "source". I am surprised that
Docker doesn't already include the container when "create" finishes though.
Your suggested approach sounds good, but lets put it in the Kubelet rather
than the cache since it will interact with update intervals and the like.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 19, 2015

@thockin the cache is getting the information straight from Docker (which has higher latency) and holding the data for some time window. The result you get will be as old as that time window. cAdvisor observes the state straight from the kernel and is informed of changes within milliseconds of their occurrence. Thus that time window is much smaller. Given a small enough time window the advantage does not matter.

Thread-per-pod looks good :) The case of the Kubelet crashing should be okay since we re-examine all our state when we come back up anyways.

@thockin
Copy link
Member

thockin commented Feb 19, 2015

Can't you bundles the cadvisor "watch the kernel" logic into a reusable pkg?

On Thu, Feb 19, 2015 at 2:37 PM, Victor Marmol notifications@github.com
wrote:

@thockin https://github.com/thockin the cache is getting the
information straight from Docker (which has higher latency) and holding the
data for some time window. The result you get will be as old as that time
window. cAdvisor observes the state straight from the kernel and is
informed of changes within milliseconds of their occurrence. Thus that time
window is much smaller. Given a small enough time window the advantage does
not matter.

Thread-per-pod looks good :) The case of the Kubelet crashing should be
okay since we re-examine all our state when we come back up anyways.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 19, 2015

I mean we can package all of cAdvisor and run it inside Kubelet :) I'm exaggerating a bit, but I just mean that at some point we have a clear separation of responsibilities. We'll need to decide where that is. That's probably out of scope for this issue though.

@thockin
Copy link
Member

thockin commented Feb 19, 2015

It just seems potentially useful to have a package that says "I know how to
tell you about containers without traversing Docker's crazy-slow API".

On Thu, Feb 19, 2015 at 3:23 PM, Victor Marmol notifications@github.com
wrote:

I mean we can package all of cAdvisor and run it inside Kubelet :) I'm
exaggerating a bit, but I just mean that at some point we have a clear
separation of responsibilities. We'll need to decide where that is. That's
probably out of scope for this issue though.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@wojtek-t
Copy link
Member

I totally agree with Tim, that having a package answering the above question would help the problem we're facing here.

Regarding "thread-per-pod" idea, I also thought about something very similar it at some point. But my fear regarding it was that if there are few multiple updates to the same pod within very short time (milliseconds), we would be processing all of the one by one. However, it we assume that there are already few of them in the channel - it should be possible to process only the last one (i.e. the most recent one). But maybe this scenario is too unreal to think about it - what do you think?

@vmarmol
Copy link
Contributor

vmarmol commented Feb 20, 2015

Any one individual pod will most likely not see multiple update within
milliseconds. We've found that in practice this is not the case even with
very heavy users. The thread per pod model should work well and scale well
for us as well.
On Feb 19, 2015 11:13 PM, "wojtek-t" notifications@github.com wrote:

I totally agree with Tim, that having a package answering the above
question would help the problem we're facing here.

Regarding "thread-per-pod" idea, I also thought about something very
similar it at some point. But my fear regarding it was that if there are
few multiple updates to the same pod within very short time (milliseconds),
we would be processing all of the one by one. However, it we assume that
there are already few of them in the channel - it should be possible to
process only the last one (i.e. the most recent one). But maybe this
scenario is too unreal to think about it - what do you think?


Reply to this email directly or view it on GitHub
#4119 (comment)
.

@wojtek-t
Copy link
Member

Thanks Victor. SGTM.

@thockin
Copy link
Member

thockin commented Feb 20, 2015

This is the same fundamental model that Omlet uses and was pretty carefully
considered for Borglet too.

@dchen1107 as final arbiter

On Thu, Feb 19, 2015 at 11:20 PM, wojtek-t notifications@github.com wrote:

Thanks Victor. SGTM.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@dchen1107
Copy link
Member Author

I couldn't understand why I missed this discussion at the first place. Thank @wojtek-t for running those experiments and get those detail data.

The idea of "thread-per-pod" should work and it is what our internal system aims for to solve performance and scalability issues. Next step, PodManagerLoop could generate PodStatus, and cache somewhere; Kubelet periodically POST NodeStatus and PodStatus to apiserver.

One potential catch we might hit is that at a given time, multiple pods might pull the same image from docker repository. Last time when I measured, the latency of simultaneous pullings of the same image is siginificantly higher than a single pulling. I didn't look into it deeply then, but that is a while back, docker might fix such issue. I will measure it tomorrow.

@dchen1107 dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 20, 2015
@wojtek-t wojtek-t self-assigned this Feb 20, 2015
@wojtek-t
Copy link
Member

Thanks @dchen1107

Regarding simulaneous pullings of the same image - I think it shouldn't make things worse (at least in the currently existing code). Basically, the "pullImage" Kubelet method is protected by "pullLock", so it ensures that we won't be pulling any two images simultaneously. We can leave this for now - what do you think?

@wojtek-t
Copy link
Member

So I will try to experiment with "thread-per-pod" approach anyway. Will post into this thread when I will have some findings.

@thockin
Copy link
Member

thockin commented Feb 20, 2015

To be clear, thread-per-pod should not be a critical goal by itself. If it
is a better way to achieve the 1.0 goals, OK, but don't burn time on it for
its own sake. At least not yet.

On Fri, Feb 20, 2015 at 1:12 AM, wojtek-t notifications@github.com wrote:

So I will try to experiment with "thread-per-pod" approach anyway. Will
post into this thread when I will have some findings.

Reply to this email directly or view it on GitHub
#4119 (comment)
.

@wojtek-t
Copy link
Member

Sure - but I think that thanks to it we can avoid starting unnecessary pods, which can significantly improve performance. I will write some prototype and try to measure gains - if that won't be satisfying I will just drop it (for now).

@vmarmol
Copy link
Contributor

vmarmol commented Feb 20, 2015

Docker does its own locking for parallel pulls so it shouldn't be a problem for now.

@vishh
Copy link
Contributor

vishh commented Feb 20, 2015

Its surprising that parallel pulls result in higher latency. As @vmarmol
mentioned docker pulls are synchronized by the docker daemon.

On Fri, Feb 20, 2015 at 7:39 AM, Victor Marmol notifications@github.com
wrote:

Docker does its own locking for parallel pulls so it shouldn't be a
problem for now.


Reply to this email directly or view it on GitHub
#4119 (comment)
.

@wojtek-t
Copy link
Member

@vmarmol: Thanks! So do you suggest that "pullLock" in Kubelet is completely useless now? If so I'm happy to remove it.

Regarding "thread-per-pod" discussion. I've implemented very simple prototype of it (i.e. some unit tests are failing etc.), but it works in simple cases (that I'm experimenting with here).

And (at least with this simple prototype) it seems to give us quite a big gains. In particular, resizing replication controller from 0 to 50 with the new code (and with docke cache patched) seems to now finish within 35-40 seconds (which I think may be quite close to our goal).
So I'm going to spend a bit more time on it and prepare a PR for it.

@vmarmol
Copy link
Contributor

vmarmol commented Feb 23, 2015

Great news @wojtek-t! I'd be happy to see the PR whenever you have it ready.

After reading the description for pullLock it looks like we do need it, but it seems we've been using it wrong :-\ I'll go ahead and fix that today. The lock doesn't protect from multiple pulls at the same time, rather from pulls and deletes racing.

@lavalamp
Copy link
Member

lavalamp commented Mar 3, 2015

Is there more work to be done on this?

@wojtek-t
Copy link
Member

wojtek-t commented Mar 9, 2015

Yes - at least I need to add a unit test for podworkers which I didn't manage to do before my vacation. Will work on it this week.

@wojtek-t
Copy link
Member

#4839 and #5239 are already merged.
I think that this works pretty reliably for 50 pods now.
I'm closing this issue now - please reopen it if you find any problem with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

9 participants