-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Determine the scalability goal for v1.2 kubelet #16943
Comments
@yujuhong - for the 90 pause pods per node use case, do you have a breakdown betweek docker and kubelet for those numbers rather than just combined? |
@derekwaynecarr, those are just combined. For 50%th, the numbers vary a lot like I mentioned. As for 70th%, kubelet is ~0.7 core, while docker ~0.8 core. |
Do we have an understanding of the performance bottlenecks in kubelet and docker? Will the current goals be achievable with just kubelet optimizations or will it require docker changes as well? |
The problem is two-fold:
(1) is the main problem as docker gets overwhelmed pretty quickly as we scale the number of pods. (2) is also significant, but not as much as (1).
The goal is still quite aggressive. If we cannot reach the goal after implementing the solutions, I will look in to kubelet's global cleanup routine, cadvisor's housekeeping, or even docker client, etc. I haven't spent time digging into the further/smaller improvements yet. However, I'd try to target the parts that are container-runtime-agnostic. I would like feedback on a few questions though:
|
@yujuhong Can we convert the content of v1.1 Resource Usage to a doc (.md file) like what we discussed the other day? Thanks! |
I think a super-fast reaction time is less important with the addition of |
IIUC, focus will be on kubelet optimizations first.
From a user perspective, low latency seems desirable. Based on the scalability blog, the pod-startup times are expected to be lesser than 5 seconds. Would it be fair to expect status updates to be within the same bounds?
Yes. I think scalability (# of pods per node) is more important that optimized resource usage as of now. |
@vishh - do you mean scalability on number of nodes in cluster instead of number of containers per node? In my experience, users running smaller cluster sizes are surprised by the overhead of system daemons on the node. |
@derekwaynecarr: I meant the number of pods a node can run without compromising reliability and performance. v1.1 includes quite a bit of optimization as illustrated by @yujuhong's data. Do you think those numbers are not acceptable for production scenarios? By |
@yujuhong Please indicate that your memory stats for both Kubelet and Docker by applying hard limit to memcg group which tricks the dirty pages to flush to the disk more frequently. But 1.1 release by default, there is no hard limit to apply to those daemons' memcg cgroup. We plan to have this for 1.2 release (aggressive goal). To document those number, so that we have a benchmark to compare for future releases, and easily to detect the regression or improvements. To answer your above two questions:
cc/ @wojtek-t who is in charging of performance in general.
|
I did in my original comment :)
Both (1) and (3), should be fine as long as kubelet doesn't go crazy with the resource usage (which it might today).
@dchen1107, do you think they are acceptable for 40 pods or for 90 pods? There is a huge different between the two sets of numbers. |
AFAIK, we are only publishing SLI. I don't think there are any SLO's documented. IIUC, the improvements you have in pipeline for node scalability should also result in reduced resource consumption right? |
My focus is mainly on detection the container failure -- sorry if I didn't make it clear in the original comment. If all we need is ~5 second latency (from container failure to pod worker waking up, excluding the actual container creation time), we might even get by with relisting PLEG + a dumb pod cache with a global update timestamp. Pod workers only have to wake up and see if the timestamp is newer than the completed time of its last sync. The cache itself, other than being populated by PLEG, will work similar to the runtime cache: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/runtime_cache.go. Of course, it will include more information (form _UPDATE: Caveat: relisting may eventually become the bottleneck if we want to scale further_
We are going to scale by reducing the resource usage footprint, which means these two things are essentially the same. Do you mean the resource usage is okay as long as we can scale to 90 pods and maintain the same resource usage as we have for 40 pods today? |
Please see #16943 (comment) for the potential implementation if we can relax the delay :) Probing is the missing piece of the puzzle in my measurement, since users can customize the probing period. I think we should create workload with different probing period, and see how much kubelet can withstand while still being performant. At the least we should have a recommended probing period for N pods/containers on the node. Users can choose to set the a smaller period for select few containers, but hopefully not all of them. |
I agree with @derekwaynecarr on this. The management overhead for a node in the steady state is quite high today. Reducing kubelet's and docker's cpu usage would help us scale the #pods on the node, but more importantly, it would reduce the resource footprint even for fewer (<= 40) pods. However, for a smaller cluster in quite a few user reports, it was the monitoring/logging pods that were hogging the resource... :-( |
@yujuhong yes, we only have issue with 2) case I listed above. What is the interval of relist in your mind? We can do some measurement with it. Honestly I don't think there is a big issue since we already introduced backoff on restarting dead containers. If some users really care about the react time for 2), we can add support for using the docker container event stream to cover it. Just need to do it step-by-step. By the way, we can also listen kernel's netlink model on process / thread's events directly to have our own process tracking system without depending on any container runtime's event stream. But that is a separate topic, and we won't get to that soon.
|
Written as a semi-educated, casual user of Kubernetes: Isn't spiking to 65% of a core every 10s essentially as bad as using 65%, always? from a capacity planning perspective I still have to shave off 1/2 a core per node for management. If we somehow saved 1/2 a core per node on a 250 node cluster, I'd use that budget to buy a 16 core master and run a 350 nodes cluster. Written as myself: We argued a lot about whether it was ok to use 2/4 extra cores on the master for 100-250 nodes. |
Update: The generic PLEG has been merged (#13571), and the excessive container listing has been disabled in #17545. kubelet:
docker:
Note that because the pod syncing period is 1 minute now, the peak cpu usage is affected by how many pods syncing simultaneously. I picket the largest number among 10 runs for each percentile to account for this fluctuation. |
This is awesome 👏 |
Great progress! Looking forward to seeing the improvement once we have runtime pod cache in. |
@luxas, I run this e2e test to get the cpu usage of kubelet and docker. The test queries cadvisor on each node for data over a preset period of time and calculate the percentiles, etc, locally. |
cc @timothysc |
cc @jeremyeder |
We should re-update and compare numbers once #19850 and accompanying bugz are fixed. |
Yes. Hopefully it won't take long... |
UPDATE: 40 pause pods:
docker:
100 pause pods:
docker:
As another reference point, I ran some stress tests with a 100 single-container (
100 pods:
Sorry for not generating pretty graphs. My attempts to collect more data in the past week were interrupted a few times due to various docker/kubelet issues. In short, running single-container 100 pause pods on a single, reasonable-spec node looks ok as far as the cpu usage is concerned for kubelet and docker. The caveats remain:
|
@yujuhong - Nice! Does this mean that we officially support 100 pods/node for 1.2? |
Could we update the defaults now? /cc @kubernetes/sig-scalability |
100 pods per node support would be awesome |
I have some reservations about bumping the max-pods to 100 because of the some issues mentioned above (e.g., we don't rate-limit batch creation/deletion, and aggressive probing could really stress the node). More importantly, when under stress, the node is likely to encounter unrecoverable docker/kernel issues. I've run into the kernel bug #20096 (which the cluster team has a workaround in progress), and moby/moby#18527 (with docker v1.9.1). I think we should still update the default, but perhaps we should be more conservative, or stress test more? |
@yujuhong Maybe 60 pods? |
@yujuhong Here is my counter argument on #16943 (comment)
|
In short, I discussed with @dchen1107 offline and we agreed that we'll update the max-pods to 100 pods and monitor our jenkins builds to see how stable node/kubelet/docker is.
Yes, high number of concurrent docker requests is the key. For the reference, there is an old issue for throttling container startup #3312, which did not receive much love. Perhaps it's time to re-evaluate. We probably won't do this for v1.2 though.
Yes, if a user creates 100 containers per pod, it'd have the same effect as far as probing is concerned. However, this should be unusual. I was simply pointing out that our setting (single-container per pod) is relatively conservative, and user may experience less responsive kubelet if they keep 2~3 containers per pod.
Besides stability, kubelet will become less responsive during batch pod creation/deletion. This duration will grow as the number of pods scale. |
I don't have a 100 node cluster. Do we have a 100-node cluster on jenkins that runs density test? |
Kubernetes-scale |
I will modify density test to start 100 pods on a node and then test locally on my 3-node cluster until it's stably passes. As for running this in a 100 node cluster, I don't think it makes a huge difference for nodes, but would probably have more impact on the apiserver. it might not be crazy to let kubernetes-scale catch this since kubemark should be a good indication that the apiserver can handle more? |
I'm not worried about Kubelet - I'm sure it'll handle. I just want to be sure that increasing pod density 3x won't expose some bugs in API server. And yes - running 100 pod/node Kubemark will do. |
I'll find out soon-ish 🎱 |
@timothysc - thanks! |
I ran the 100 pods/node density test in a 100-node kubemark cluster and it passed three times in a row. I used the same configuration as the jenkins kubemark suite: 10 n1-standard-2 gce instances. |
We have quite a few e2e tests that runs 100 pods per node regularly (e.g., tracking resource usage, and testing scheduler decisions. |
@yujuhong - yes I can. I plan on rethinking our scale testing this week, as they consume a lot of resources. I'll take high-density testing into account. |
Let's close this since v1.2 is out. |
The umbrella issue for kubelet's scalability: #12540
Kubelet manages and monitors all user pods on a node to ensure their container states meet the desired pod specifications. Every pod and/or container Kubelet manages incurs a certain resource overhead even in the absence of active events (i.e., no spec change or container lifecycle events). The exact per-pod overhead varies based on the detailed pod specification, but in general, such steady-state management overhead of is proportional to the number of active pods/containers. When there no active pods, the resource overhead should be negligible.
Below are some resource usage numbers for the v1.1 kubelet.
v1.1 Resource Usage** (commit: dd1187c)
Cluster setup/workload:
10s_1s_ intervals0 pause pods per node
40 pause pods per node
kube-proxy’s memory usage depends on the number of services in the cluster. In this test case, the number of services is 7.
Note that, kubelet syncs all pods periodically, hence even though the median cpu usage tends to be lower, the cpu spikes are significant.
90 pause pods per node?
v1.1 does not support 90 pods per node, but I did some rough measurement and the median cpu usage of kubelet and docker combined varies from 0.35 to 1.2 cores -- it was right on the edge of a steep curve. Beyond 70th%, it was consistently > 1.4 core.
Reaction latency
The latency Kubelet takes to notice a container event (e.g. died) varies and could be up to 10+ seconds, since the sync period is 10s.
v1.2 Resource Usage Target
For v1.2 we want to set a goal so that we can work towards to it. The goal would affect what approach we pick to tackle the issues.
Below are some tentative targets. Suggestions/comments are welcome.
Caveats
/cc @kubernetes/goog-node
The text was updated successfully, but these errors were encountered: