-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase maximum pods per node #23349
Comments
How do we vet that a given node configuration is "qualified enough" to meet the 500 pods per node goal? |
As discussed in the meeting, using a single number (max pods) can be misleading for the users, given the huge variation in machine specs, workload, and environment. If we have a node benchmark, we can let users profile their nodes and decide what is the best configuration for them. The benchmark can exist as a node e2e test, or in the contrib repository. @jeremyeder, you mentioned you've tried running more pods in a test environment. What's the machine spec and could you share the numbers? |
That's where the benchmark can play an important role. Community can also share the results on different platform with each other using the standardized benchmark. I'd suggest we should look at:
So far we only have very limited testing for the above (with non-realistic workloads), and RedHat should be able to contribute more. |
Let's target for having a proper benchmark for the next release, so that we can decide how many pods to support. |
@jeremyeder - from the "cluster" perspective, what really matter is the total number of pods. So if you have fewer nodes in the cluster, you can put more pods on them without affecting apiserver, etcd or controllers performance. So that doesn't seem to be a problem given that you are talking abour smaller deployments in terms of number of nodes. Also, we are planning to increase the total number of pods in 1.3. The final number is not decided, but I hope it will be 100.000 or even more (pods/cluster) |
@wojtek-t understood, thank you -- that's basically what I was wondering, if the pods-per-cluster limits would be increased during 1.3 cycle. |
@yujuhong I agree with the benchmark testing to decide the max pod number before deploy cluster, :) |
@jeremyeder Thanks for filing the issue so we can carry on the discussion. Here is a small summary what I talked in sig-node meeting as a record:
cc/ @bgrant0607 since we talked about this before. |
System reserve is arguably a systemd.slice provisioning constraint from our side, but +1 on reserve. Going forwards, I think think the only limits should be resource constraints. If needed, we could put pod limits on admission control. /cc @quinton-hoole as he was conversing on the topic in the @kubernetes/sig-scalability meeting last week. |
Agree with all comments about the benchmark. @dchen1107 perhaps we should file a new issue to deliver that and leave this one to delivering the increase, should we be able to agree on something. This test is:
500 pods (somewhere around the 700-second mark) uses a very reasonable amount of CPU (around 1 CPU core) for the node process and docker combined. This is the openshift-node process in this test, not strictly kubernetes, but it's based on kubernetes v1.2.0-36-g4a3f9c5. About the memory graphs. I did not restart the node service before running the test, so the numbers are about 300MB higher than they should be. But the growth/trend in RSS is accurate. |
@dchen1107 as far as what a "benchmark" may look like... Perhaps we generate a "node scaling score" out of factors like cpu_generation+core_count+GB_RAM+kube_version+other_factors. That score would set max-pods dynamically. This way we don't have to inject a "test" into the admission control pipeline or product install paths, the node process could compute the score/max-pods dynamically during it's startup phase. Thoughts ? |
+1 for dynamic limits. Those limits should also take into account kubelet's On Thu, Mar 24, 2016 at 1:28 PM, Jeremy Eder notifications@github.com
|
Another factor we need to consider is probing. As #16943 (comment) shows, agressive liveness / readiness probing can have a significant impact on performance. We may eventually need to figure out how to account probe usage to the containers being probed. |
@timstclair: Unless we can move to an |
That's why I think benchmark with realistic/customizable workloads is valuable. Users can benchmark their cluster and adjust if they want (e.g., determine max pods allowed with 10% of dedicated resources). |
We also need to take disk resources into account going forwards, right now that's a level of overhead that we haven't really captured. |
@jeremyeder I can file a separate benchmark issue. Actually I were in the middle of filing that, and saw this issue and everyone jumped into all over talking about the benchmark. But on another side, I think publishing the benchmark can serve the purpose without keeping increasing --max-pods per node. Node team signed up to:
But like what I mentioned at #23349 (comment) and several examples listed by others above, there are too many varieties and a lot of them are even out of our control, I don't think we are ready to have a formula applying to all to dynamically figure out --max-pods. Ideally, the cluster should easily have their own formula with fudge factors based on their own environments and requirements, and come up the max-pods for their nodes and apply to the node config. If applying the value of max-pods to the node config object is too hard, we should solve this usability issue. cc/ @davidopp @bgrant0607 |
I think we all agreed that developing a node benchmark should be the next step. The benchmark will allow users to test their nodes and adjust kubelet configuration (e.g., --max-pods) accordingly. They can also publish the results and share them with the community. The results can serve as a ballpark for users who just want some configuration to start with. In addition to that, having the published results will also help us discover issues on different platform. Some initial thoughts about what we want in the benchmark:
We should have diverse workload (e.g., probing) and test the node in difference scenarios (e.g., steady state vs batch creation). What we have now in our e2e tests is very limited. Redhat and/or the community may want to chime in to share what they have. Anyway, below is what we use today:
|
@timothysc Could you please clarify the following comment?
Which resource constraints? The containers'? Each pod and container requires some amount of resources from the management agents (Kubelet, cadvisor, docker) and kernel. These resources can't be attributed to the cgroups of the containers (we've been working on such things for years internally). Depending on the resources allocated to the system and management agents, on the average number of containers per pod, on the rate of container churn, on the number of containers with probes, assumptions about container failure rates, etc., a different number of pods might be supportable, though there are also inherent limits in the management agents, since they are not perfectly scalable. These factors are complex and numerous. Additionally, we currently allow best-effort pods to not reserve any fixed amount of resources. This is a deliberate choice. Admins may choose to impose a minimum resource request, but that's independent of the issues mentioned above. |
OBJECTIVE
I meant available machine resources that exist for the kubelet + container subtree "slice". At some point the machine simply passes an acceptable watermark and no further containers should be accepted. We are not aiming for optimal packing in this use case, so "fudging" the reserve and watermarks can be controlled by the administrator based on their load profiles and history. Admins would like to "set it and forget it". In conversations on sig-node we all agree on the premise of not putting false limits in place, but at this point I believe we need to hash out the designs that were discussed and make it a reality in 1.3. |
@timothysc Other considerations:
I agree that accurate, simple, automatically set limits would be desirable. |
There is also a realistic difference in time it takes to go from 99-100 On Mon, Mar 28, 2016 at 12:50 PM, Brian Grant notifications@github.com
|
@derekwaynecarr, do you mean that batch creation of 100 pods makes the node less stable than when starting a single pod at 99 pods? We don't limit the docker qps in kubelet and creating/deleting pods are the heaviest operations for now. We discussed before v1.2 to mitigate this issue, but decided that it wasn't necessary for v1.2. If we are going to claim support of higher pod density, we'd need to add tests and re-evaluate this. |
cc/ @gmarek @wojtek-t This is the issue I mentioned to @gmarek earlier. For 1.3 release, we plan to publish node level benchmark at #23349 (comment) To do that, we need to define our performance SLO at node level. |
@yujuhong - yes. i think qps is an important thing to keep in mind as we change this number. i was able to discover the pod stuck in terminating problem by overwhelming the docker daemon with this scenario. |
cc/ @coufon |
Hi, how many pod per node do you test? |
We had production systems running 600 pods/node. The only issue we faced was that on a kubelet restart, the node flaps between read and notReady (PLEG issues). Solution is to delete the docker directory and restart kubelet and docker together. Another issue was this, but I'm not sure if it was related to too many pods/node -- |
@dElogics how many containers per Pod? With PLEG issues the container count can be an important factor. I learned from a sig-scalability talk at Seattle Kubecon 2018 that scalability performance tests only use 2-container Pods. |
I am wondering why 100 or 110 is such a hard limit. I have two BareMetal-workers with each 40 cpus and 125G RAM and also one virtual worker with a smaller amount of CPU and RAM. As I have all my containers defined with max-limit-memory (most of them <1GB RAM, only few with some more GB, this will not be the limitation factor. The main questions for me seem to be in #23349 (comment)
How do you do that? |
It's a kubelet configuration parameter. |
@AndrewSav thanks. |
Single container per pod. |
Is the state "Open" still current, or is there a kubernetes Release out solving this request? |
/sig scalability It's clear that some people are running with 500+ pods per node already. I'm not sure if we have scalability tests to guarantee it though. Tagging in scalability for a possible update. |
Have anyone actually done testing with some big numbers of pods? I mean like >= 10 000 pods per node? Where is the bottlenecks? Afaiu kube-proxy might be problematic because of iptables but when using CNI with eBPF like Cilium even that shouldn't be issue. I'm coming for Docker Swarm world and is very surprised to see these very small numbers here. There we run ~100 containers per node with VMs which have 4 CPU cores and 32 GB RAM in production without problems. Swarm does not have real RBAC which we have been forced to use those small VMs but as Kubernetes have good RBAC and policy engines like Kyverno which makes possible to force containers running without root privileges I was expecting that bare metal node with 1 TB RAM should be easily able to run something like 1000-2000 pods per node. |
any update for this issue ? Why this happened? |
We also have 3 servers with 88 CPUs and 256 GB RAM and each. But increasing the pod limit to 240 leads to PLEG issues. We run directly on containerd and after adjusting the kubelet parameter, reboot + deleting /var/lib/containerd didn't help. |
@olljanat |
OpenShift docs recommends to change two other params to improve scalability:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: <pod_count> # default 110
kubeAPIQPS: <QPS> # default 50 since 1.27
kubeAPIBurst: <burst_rate> # default 100 since 1.27 See also related https://www.redhat.com/en/blog/500_pods_per_node. ref: KubeletConfiguration EDIT: those QPS/Burst params were bumped in 1.27 by #116121. So probably not needed anymore |
I was just about to say, this issue is more important than ever. Today you can get servers >96 cores + HT and tons of RAM. The limited number of pods is a severe restriction. And the workaround, introducing another virtualization layer to run your nodes (KVM, etc) might introduce unexpected side effects. |
Answering to my own question as it was requested and I hope that it helps those who are not (yet) that far in learning curve of Kubernetes. This issue is created on year 2016 and on that time only options was to use proxy mode "iptables". In year #46580 added support for kube-proxy parameter However, because Kubernetes aims to be backward compatible, the default was not changed and most likely newer will be.
There is multiple good posts about this topic in Alibaba and RedHat blogs. From those you can also get idea why both Calico and Cilinium have created eBPF based solutions which completely replace kube-proxy and how that allow to scale even bigger numbers.
Yes, but this wrong place to fix the issue. Correct places are either deployment configuration or alternatively used Kubernetes distribution.
My recommendation for anyone who is new with Kubernetes is to be use managed Kubernetes distribution from public cloud like AKS, EKS or GKE. This is because those are battletested and because Kubernetes is amed to be used with cluster autoscaler which those public clouds offers. If you really need to run Kubernetes in on-prem then use something like Nutanix Karbon, Red Hat OpenShift or VMware Tanzu. Yes, you will have extra virtualization layer but that is not the part you should be worry about. Instead of focus to plan how you can treat your Kubernetes clusters as a cattle which you can replace with new ones instead of upgrading existing ones because eventually there will be some change needed to your configuration which will be very tricky deploy to existing clusters. If after all these warnings you still want to run Kubernetes in bare metal, then my recommendation would be looking something like K0s or K3s as it is much easier to get changes approved to those. PS. If someone have more questions about this topic, you can reach me example from Docker or Kubernetes Slack, send me DM in LinkedIn or simply send me email ( first name . last name @ gmail ). |
As discussed on the sig-node call on March 22:
max-pods on kube-1.1 was 40, kube-1.2 is 110 pods per node.
We have use-cases expressed by customers for increased node vertical scalability. This is (generally) for environments using fewer larger capacity nodes and perhaps running lighter-weight pods.
For kube-1.3 we would like to discuss targeting a 100 node cluster running 500 pods per node. This will require coordination with @kubernetes/sig-scalability as it would increase the total pods-per-cluster.
/cc @kubernetes/sig-node @kubernetes/sig-scalability @dchen1107 @timothysc @derekwaynecarr @ncdc @smarterclayton @pmorie
Thoughts?
The text was updated successfully, but these errors were encountered: