Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase maximum pods per node #23349

Open
jeremyeder opened this issue Mar 22, 2016 · 58 comments
Open

Increase maximum pods per node #23349

jeremyeder opened this issue Mar 22, 2016 · 58 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jeremyeder
Copy link

As discussed on the sig-node call on March 22:

max-pods on kube-1.1 was 40, kube-1.2 is 110 pods per node.

We have use-cases expressed by customers for increased node vertical scalability. This is (generally) for environments using fewer larger capacity nodes and perhaps running lighter-weight pods.

For kube-1.3 we would like to discuss targeting a 100 node cluster running 500 pods per node. This will require coordination with @kubernetes/sig-scalability as it would increase the total pods-per-cluster.

/cc @kubernetes/sig-node @kubernetes/sig-scalability @dchen1107 @timothysc @derekwaynecarr @ncdc @smarterclayton @pmorie

Thoughts?

@spiffxp
Copy link
Member

spiffxp commented Mar 22, 2016

How do we vet that a given node configuration is "qualified enough" to meet the 500 pods per node goal?

@yujuhong
Copy link
Contributor

As discussed in the meeting, using a single number (max pods) can be misleading for the users, given the huge variation in machine specs, workload, and environment. If we have a node benchmark, we can let users profile their nodes and decide what is the best configuration for them. The benchmark can exist as a node e2e test, or in the contrib repository.

@jeremyeder, you mentioned you've tried running more pods in a test environment. What's the machine spec and could you share the numbers?

@yujuhong yujuhong added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 22, 2016
@yujuhong
Copy link
Contributor

How do we vet that a given node configuration is "qualified enough" to meet the 500 pods per node goal?

That's where the benchmark can play an important role. Community can also share the results on different platform with each other using the standardized benchmark.

I'd suggest we should look at:

  • management overhead in terms of resource usage
  • performance (responsiveness) in terms of latency for various operations (create/delete pod, etc). Or even detecting the container changes.

So far we only have very limited testing for the above (with non-realistic workloads), and RedHat should be able to contribute more.

@yujuhong yujuhong added this to the next-candidate milestone Mar 22, 2016
@yujuhong
Copy link
Contributor

Let's target for having a proper benchmark for the next release, so that we can decide how many pods to support.

@wojtek-t
Copy link
Member

@jeremyeder - from the "cluster" perspective, what really matter is the total number of pods. So if you have fewer nodes in the cluster, you can put more pods on them without affecting apiserver, etcd or controllers performance. So that doesn't seem to be a problem given that you are talking abour smaller deployments in terms of number of nodes.

Also, we are planning to increase the total number of pods in 1.3. The final number is not decided, but I hope it will be 100.000 or even more (pods/cluster)

@jeremyeder
Copy link
Author

@wojtek-t understood, thank you -- that's basically what I was wondering, if the pods-per-cluster limits would be increased during 1.3 cycle.

@zhouhaibing089
Copy link
Contributor

@yujuhong I agree with the benchmark testing to decide the max pod number before deploy cluster, :)

@dchen1107
Copy link
Member

@jeremyeder Thanks for filing the issue so we can carry on the discussion.

Here is a small summary what I talked in sig-node meeting as a record:

  • max-pods is configurable. The default value we chosen today is targeted for the users who want an out-of-box solution, and is decided under several constraints:
    • Docker's performance and limitation. Docker has improved a lot since 1.8 release, but still a lot of room to improve. Unfortunately today docker's management overhead highly depends on the workloads on the node, which makes
    • Kubelet's performance and management overhead. We improved this dramatically in 1.2 release through several projects: PLEG, new metrics API, and cAdvisor cleanup, etc. We will continue improving it for each release.
    • Available resource at node, including cpu, memory, ip, etc.
    • Cluster level components (like: apiserver, scheduler, and controller-manager, heapster, etc.) limit since you have think about both num-of-node and num-of pods-per-node.
    • Performance SLOs we made for the users
    • etc.
  • Based on my experience with Google internal system, even our node team improved the node level scalability and performance dramatically over time, in reality, most (>99%) nodes in a shared cluster (shared means host both service and batch jobs, both production and testing jobs, etc.) hosts no more than ~100 jobs due to other resource constraints.
  • On another hand, I understood that for certain users, they want to run simple jobs with limited management overhead. For example, I ran an experimental test against 1.1 release on a big node, KNode (Kubelet, and docker) can easily host > 200 do-nothing pods with reasonable performance. 1.2 release can do much better job if I re-run the same experimental test.
  • I suggested to publish a node benchmark since 1.1 release (Export resource usage benchmarks to an easily consumable place #14754 etc.), and we are working toward to that goal for each release:

cc/ @bgrant0607 since we talked about this before.

@timothysc
Copy link
Member

Introduced --kube-reserve and --system-reserve (#17201) to Kubelet so that the admin to config the proportion of overall resource (cpu and memory) a node devote to daemons including Kubelet, docker along with other daemons. Today it is properly configured due to lack of benchmark here. :-)

System reserve is arguably a systemd.slice provisioning constraint from our side, but +1 on reserve. Going forwards, I think think the only limits should be resource constraints. If needed, we could put pod limits on admission control.

/cc @quinton-hoole as he was conversing on the topic in the @kubernetes/sig-scalability meeting last week.

@jeremyeder
Copy link
Author

Agree with all comments about the benchmark. @dchen1107 perhaps we should file a new issue to deliver that and leave this one to delivering the increase, should we be able to agree on something.

This test is:

  1. sleep 60
  2. schedule 100 "hello-openshift" pods across 2 nodes https://github.com/openshift/origin/tree/master/examples/hello-openshift
  3. wait til they are all running
  4. sleep 60
  5. schedule 100 more
  6. loop up through 800 pods per node.
  7. sleep 60

stacked_cpu

stacked_mem

500 pods (somewhere around the 700-second mark) uses a very reasonable amount of CPU (around 1 CPU core) for the node process and docker combined. This is the openshift-node process in this test, not strictly kubernetes, but it's based on kubernetes v1.2.0-36-g4a3f9c5.

About the memory graphs. I did not restart the node service before running the test, so the numbers are about 300MB higher than they should be. But the growth/trend in RSS is accurate.

@jeremyeder
Copy link
Author

@dchen1107 as far as what a "benchmark" may look like...

Perhaps we generate a "node scaling score" out of factors like cpu_generation+core_count+GB_RAM+kube_version+other_factors. That score would set max-pods dynamically.

This way we don't have to inject a "test" into the admission control pipeline or product install paths, the node process could compute the score/max-pods dynamically during it's startup phase.

Thoughts ?

@vishh
Copy link
Contributor

vishh commented Mar 24, 2016

+1 for dynamic limits. Those limits should also take into account kubelet's
internal design though, specifically around latency.

On Thu, Mar 24, 2016 at 1:28 PM, Jeremy Eder notifications@github.com
wrote:

@dchen1107 https://github.com/dchen1107 as far as what a "benchmark"
may look like...

Perhaps we generate a "node scaling score" out of factors like
cpu_generation+core_count+GB_RAM+kube_version+other_factors. That score
would set max-pods dynamically.

This way we don't have to inject a "test" into the admission control
pipeline or product install paths, the node process could compute the
score/max-pods dynamically during it's startup phase.

Thoughts ?


You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#23349 (comment)

@timstclair
Copy link

Another factor we need to consider is probing. As #16943 (comment) shows, agressive liveness / readiness probing can have a significant impact on performance. We may eventually need to figure out how to account probe usage to the containers being probed.

@vishh
Copy link
Contributor

vishh commented Mar 24, 2016

@timstclair: Unless we can move to an exec model for all probing, it will be difficult to tackle charging.

@yujuhong
Copy link
Contributor

Another factor we need to consider is probing. As #16943 (comment) shows, agressive liveness / readiness probing can have a significant impact on performance. We may eventually need to figure out how to account probe usage to the containers being probed.

That's why I think benchmark with realistic/customizable workloads is valuable. Users can benchmark their cluster and adjust if they want (e.g., determine max pods allowed with 10% of dedicated resources).

@timothysc
Copy link
Member

We also need to take disk resources into account going forwards, right now that's a level of overhead that we haven't really captured.

@dchen1107
Copy link
Member

@jeremyeder I can file a separate benchmark issue. Actually I were in the middle of filing that, and saw this issue and everyone jumped into all over talking about the benchmark.

But on another side, I think publishing the benchmark can serve the purpose without keeping increasing --max-pods per node. Node team signed up to:

  • together with other teams & community to define the performance SLOs
  • together with community to choose one or several representative workloads to generate benchmark for node performance and scalability
  • continue improving the performance
  • continue reducing the system overhead introduced for management
  • improving our node test suite to detect the regression on both performance and resource consumption

But like what I mentioned at #23349 (comment) and several examples listed by others above, there are too many varieties and a lot of them are even out of our control, I don't think we are ready to have a formula applying to all to dynamically figure out --max-pods.

Ideally, the cluster should easily have their own formula with fudge factors based on their own environments and requirements, and come up the max-pods for their nodes and apply to the node config. If applying the value of max-pods to the node config object is too hard, we should solve this usability issue.

cc/ @davidopp @bgrant0607

@yujuhong
Copy link
Contributor

I think we all agreed that developing a node benchmark should be the next step. The benchmark will allow users to test their nodes and adjust kubelet configuration (e.g., --max-pods) accordingly. They can also publish the results and share them with the community. The results can serve as a ballpark for users who just want some configuration to start with. In addition to that, having the published results will also help us discover issues on different platform.

Some initial thoughts about what we want in the benchmark:

  1. user-observed latency (e.g., pod startup/deletion, time to restart a container that just died). We need to adhere to the k8s SLO, and maybe add new kubelet SLO on top of that.
  2. resource usage (e.g., cpu, memory, disk)

We should have diverse workload (e.g., probing) and test the node in difference scenarios (e.g., steady state vs batch creation).

What we have now in our e2e tests is very limited. Redhat and/or the community may want to chime in to share what they have. Anyway, below is what we use today:

  • density test: starts up N pods and then start an additional pod and track its latency.
  • resource usage test: tracks the cpu/memory usage of kubelet and docker on a steady node with N pods. The cpu usage collection in this test depends on cadvisor at a granularity of 10s by default.

@bgrant0607
Copy link
Member

@timothysc Could you please clarify the following comment?

Going forwards, I think think the only limits should be resource constraints. If needed, we could put pod limits on admission control.

Which resource constraints? The containers'?

Each pod and container requires some amount of resources from the management agents (Kubelet, cadvisor, docker) and kernel. These resources can't be attributed to the cgroups of the containers (we've been working on such things for years internally). Depending on the resources allocated to the system and management agents, on the average number of containers per pod, on the rate of container churn, on the number of containers with probes, assumptions about container failure rates, etc., a different number of pods might be supportable, though there are also inherent limits in the management agents, since they are not perfectly scalable. These factors are complex and numerous.

Additionally, we currently allow best-effort pods to not reserve any fixed amount of resources. This is a deliberate choice. Admins may choose to impose a minimum resource request, but that's independent of the issues mentioned above.

@timothysc
Copy link
Member

OBJECTIVE
Users would like to achieve much higher density numbers per machine with a large number of underutilized pods. Current --max-pods represents an artificial governor for machines which have ample resources available. Instead, admins would prefer to set some system reserve, as well as resource thresholds (watermarks) after which pods are not longer accepted.

Which resource constraints? The containers'?

I meant available machine resources that exist for the kubelet + container subtree "slice". At some point the machine simply passes an acceptable watermark and no further containers should be accepted. We are not aiming for optimal packing in this use case, so "fudging" the reserve and watermarks can be controlled by the administrator based on their load profiles and history. Admins would like to "set it and forget it".

In conversations on sig-node we all agree on the premise of not putting false limits in place, but at this point I believe we need to hash out the designs that were discussed and make it a reality in 1.3.

@bgrant0607
Copy link
Member

@timothysc Other considerations:

  • Kubelet, Docker, network and storage plugins, etc. are not perfectly scalable
  • Increased load on these agents impacts quality of service from these components
  • An explicit, predictable limit (as opposed to opaquely just denying requests at some point) helps schedulers make better decisions (e.g., avoiding resource stranding) and helps users understand placement decisions

I agree that accurate, simple, automatically set limits would be desirable.

@derekwaynecarr
Copy link
Member

There is also a realistic difference in time it takes to go from 99-100
running pods in my experience than it takes to go from 0-100 running pods.
Right now, I have been running a loop in a three-node cluster that creates
a NS with a single RC with 500 pods, and I wait for at least 200 of those
pods to report back running before terminating the namespace (I am trying
to debug a stuck terminating pod flake that is hard to reproduce), but it
seems extremely obvious to me that we are less stable going from 0-100 pods
running on a node than we are going from 99-100.

On Mon, Mar 28, 2016 at 12:50 PM, Brian Grant notifications@github.com
wrote:

@timothysc https://github.com/timothysc Other considerations:

  • Kubelet, Docker, network and storage plugins, etc. are not perfectly
    scalable
  • Increased load on these agents impacts quality of service from these
    components
  • An explicit, predictable limit (as opposed to opaquely just denying
    requests at some point) helps schedulers make better decisions (e.g.,
    avoiding resource stranding) and helps users understand placement decisions

I agree that accurate, simple, automatically set limits would be desirable.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23349 (comment)

@yujuhong
Copy link
Contributor

seems extremely obvious to me that we are less stable going from 0-100 pods
running on a node than we are going from 99-100.

@derekwaynecarr, do you mean that batch creation of 100 pods makes the node less stable than when starting a single pod at 99 pods? We don't limit the docker qps in kubelet and creating/deleting pods are the heaviest operations for now. We discussed before v1.2 to mitigate this issue, but decided that it wasn't necessary for v1.2. If we are going to claim support of higher pod density, we'd need to add tests and re-evaluate this.

@dchen1107
Copy link
Member

cc/ @gmarek @wojtek-t This is the issue I mentioned to @gmarek earlier. For 1.3 release, we plan to publish node level benchmark at #23349 (comment) To do that, we need to define our performance SLO at node level.

@derekwaynecarr
Copy link
Member

@yujuhong - yes. i think qps is an important thing to keep in mind as we change this number. i was able to discover the pod stuck in terminating problem by overwhelming the docker daemon with this scenario.

@dchen1107
Copy link
Member

cc/ @coufon

@libratiger
Copy link

More valuable information, increase fs.inotify.max_user_watches and net.ipv4.neigh.default.gc_thresh* which improves stability.

Hi, how many pod per node do you test?
Can you provide more detail information about the improvement, this can be helpful with numbers.

@dElogics
Copy link

We had production systems running 600 pods/node.

The only issue we faced was that on a kubelet restart, the node flaps between read and notReady (PLEG issues). Solution is to delete the docker directory and restart kubelet and docker together.

Another issue was this, but I'm not sure if it was related to too many pods/node --
moby/moby#5618

@jstangroome
Copy link

jstangroome commented Mar 30, 2020

We had production systems running 600 pods/node.

The only issue we faced was that on a kubelet restart, the node flaps between read and notReady (PLEG issues). Solution is to delete the docker directory and restart kubelet and docker together.

@dElogics how many containers per Pod? With PLEG issues the container count can be an important factor. I learned from a sig-scalability talk at Seattle Kubecon 2018 that scalability performance tests only use 2-container Pods.

@rdxmb
Copy link

rdxmb commented Apr 14, 2020

I am wondering why 100 or 110 is such a hard limit. I have two BareMetal-workers with each 40 cpus and 125G RAM and also one virtual worker with a smaller amount of CPU and RAM.

As I have all my containers defined with max-limit-memory (most of them <1GB RAM, only few with some more GB, this will not be the limitation factor.

The main questions for me seem to be in #23349 (comment)

max-pods is configurable.
We had production systems running 600 pods/node.

How do you do that?

@AndrewSav
Copy link

max-pods is configurable.
We had production systems running 600 pods/node.

How do you do that?

It's a kubelet configuration parameter.

@rdxmb
Copy link

rdxmb commented Apr 15, 2020

@AndrewSav thanks.

@dElogics
Copy link

We had production systems running 600 pods/node.
The only issue we faced was that on a kubelet restart, the node flaps between read and notReady (PLEG issues). Solution is to delete the docker directory and restart kubelet and docker together.

@dElogics how many containers per Pod? With PLEG issues the container count can be an important factor. I learned from a sig-scalability talk at Seattle Kubecon 2018 that scalability performance tests only use 2-container Pods.

Single container per pod.

@airbjorn
Copy link

Is the state "Open" still current, or is there a kubernetes Release out solving this request?

@ehashman
Copy link
Member

/sig scalability
/kind feature
/triage accepted

It's clear that some people are running with 500+ pods per node already. I'm not sure if we have scalability tests to guarantee it though. Tagging in scalability for a possible update.

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 25, 2021
@olljanat
Copy link

olljanat commented Nov 5, 2021

Have anyone actually done testing with some big numbers of pods? I mean like >= 10 000 pods per node? Where is the bottlenecks? Afaiu kube-proxy might be problematic because of iptables but when using CNI with eBPF like Cilium even that shouldn't be issue.

I'm coming for Docker Swarm world and is very surprised to see these very small numbers here. There we run ~100 containers per node with VMs which have 4 CPU cores and 32 GB RAM in production without problems.

Swarm does not have real RBAC which we have been forced to use those small VMs but as Kubernetes have good RBAC and policy engines like Kyverno which makes possible to force containers running without root privileges I was expecting that bare metal node with 1 TB RAM should be easily able to run something like 1000-2000 pods per node.

@ccnupq
Copy link

ccnupq commented Apr 19, 2022

any update for this issue ?
When I tested >110 pods/node, we find the new pods lanuch really slow (>30s) and there is no other negative effects .
My node still has lots CPU and Memory.

Why this happened?
Does Kubelet has some limitation ?

@DopeforHope
Copy link

We also have 3 servers with 88 CPUs and 256 GB RAM and each. But increasing the pod limit to 240 leads to PLEG issues. We run directly on containerd and after adjusting the kubelet parameter, reboot + deleting /var/lib/containerd didn't help.
Any updates or better guides on this one? Maybe help for debugging the pleg?

@mboukhalfa
Copy link
Member

mboukhalfa commented Jan 2, 2024

Where is the bottlenecks?

@olljanat
Any related findings got ?

@sathieu
Copy link

sathieu commented Jan 15, 2024

OpenShift docs recommends to change two other params to improve scalability:

  • kubeAPIBurst
  • kubeAPIQPS
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
maxPods: <pod_count>  # default 110
kubeAPIQPS: <QPS>  # default 50 since 1.27
kubeAPIBurst: <burst_rate>  # default 100 since 1.27

See also related https://www.redhat.com/en/blog/500_pods_per_node.

ref: KubeletConfiguration

EDIT: those QPS/Burst params were bumped in 1.27 by #116121. So probably not needed anymore

@lindhe
Copy link

lindhe commented Nov 22, 2024

@harridu @kwenzh

Please refrain from posting comments just to subscribe to the issue, this contributes to a spammy environment for everyone else. Please give the Issue a 👍 if you want to draw attention to it or click "subscribe" on GitHub to get notifications without commenting.

@harridu
Copy link

harridu commented Nov 24, 2024

I was just about to say, this issue is more important than ever. Today you can get servers >96 cores + HT and tons of RAM. The limited number of pods is a severe restriction. And the workaround, introducing another virtualization layer to run your nodes (KVM, etc) might introduce unexpected side effects.

@olljanat
Copy link

Where is the bottlenecks?

Answering to my own question as it was requested and I hope that it helps those who are not (yet) that far in learning curve of Kubernetes.

This issue is created on year 2016 and on that time only options was to use proxy mode "iptables". In year #46580 added support for kube-proxy parameter --proxy-mode ipvs which overcomes many of those limitations. With that you can provide --max-pods parameter for kubelet and get a lot of more pods running per node.

However, because Kubernetes aims to be backward compatible, the default was not changed and most likely newer will be.

Have anyone actually done testing with some big numbers of pods? I mean like >= 10 000 pods per node?

There is multiple good posts about this topic in Alibaba and RedHat blogs. From those you can also get idea why both Calico and Cilinium have created eBPF based solutions which completely replace kube-proxy and how that allow to scale even bigger numbers.

this issue is more important than ever. Today you can get servers >96 cores + HT and tons of RAM. The limited number of pods is a severe restriction.

Yes, but this wrong place to fix the issue. Correct places are either deployment configuration or alternatively used Kubernetes distribution.

And the workaround, introducing another virtualization layer to run your nodes (KVM, etc) might introduce unexpected side effects.

My recommendation for anyone who is new with Kubernetes is to be use managed Kubernetes distribution from public cloud like AKS, EKS or GKE. This is because those are battletested and because Kubernetes is amed to be used with cluster autoscaler which those public clouds offers.

If you really need to run Kubernetes in on-prem then use something like Nutanix Karbon, Red Hat OpenShift or VMware Tanzu.

Yes, you will have extra virtualization layer but that is not the part you should be worry about. Instead of focus to plan how you can treat your Kubernetes clusters as a cattle which you can replace with new ones instead of upgrading existing ones because eventually there will be some change needed to your configuration which will be very tricky deploy to existing clusters.

If after all these warnings you still want to run Kubernetes in bare metal, then my recommendation would be looking something like K0s or K3s as it is much easier to get changes approved to those.

PS. If someone have more questions about this topic, you can reach me example from Docker or Kubernetes Slack, send me DM in LinkedIn or simply send me email ( first name . last name @ gmail ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests