-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable dynamic allocation of heapster/eventer cpu request/limit #27185
Conversation
{% set eventer_memory_per_node = 500 -%} | ||
{% set num_nodes = pillar.get('num_nodes', -1) -%} | ||
{% if num_nodes >= 0 -%} | ||
{% set metrics_memory = (200 + num_nodes * metrics_memory_per_node)|string + "Mi" -%} | ||
{% set metrics_cpu = (num_nodes + 80)|string + "m" -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean that we would use 2.080 cores for 2000-node clusters. Zach said that it would be painful, because that would mean that heapster doesn't fit on 2-core machines.
On the other hand, 1 core seem to be good enough for heapster, but IIUC it's not possible to pass 0.5 millicore to pod nanny, which is so painful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update - it's possible to pass 0.5 millicore to the nanny.
However, in nanny it would be treated as 1millicore anyway. We would need to change nanny, which is not trivial, because resource.Quantity (which is used by nanny) doesn't support "multiplication" operation... heh...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I think I know how to solve it relatively easy. We should add another flag to pod-nany that will allow us to add a given number of millicores per given number of nodes. If we set the second flag to 2, and the existing flag to 1millicore, this will result in adding 1 millicore per 2 nodes. This should be easy to do - I can help with that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have a PR in-flight to pod-nanny, that will enable passing less-than-millicore values to nanny and have them being correctly handled:
kubernetes-retired/contrib#1195
So once this is in, we should just change it to "0.5m" and we will be all happy, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - so I implemented and pushed the new version of pod-nanny.
I'm bumping its version in #27268 and in the same PR I'm using the new feature by passing 0.5 millicore per node. This is the most important part of this change, so maybe we can push this one out of 1.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add multiplication to quantity anyway for another use case. I'll take a todo.
{% set base_eventer_memory = "190Mi" -%} | ||
{% set eventer_memory = base_eventer_memory -%} | ||
{% set metrics_memory_per_node = 4 -%} | ||
{% set metrics_cpu_per_node = 1 -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 0.5m per node.
Anyway - I think that we should first merge: #27268 and only then rebase this one to be compatible with that one.
c50b193
to
f6c73c8
Compare
Possible conflict with #27187 ? |
{% set eventer_memory_per_node = 500 -%} | ||
{% set num_nodes = pillar.get('num_nodes', -1) -%} | ||
{% if num_nodes >= 0 -%} | ||
{% set metrics_memory = (200 + num_nodes * metrics_memory_per_node)|string + "Mi" -%} | ||
{% set metrics_cpu = (80 + num_nodes * metrics_cpu_per_node)|string + "m" -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equation sufficient? Looking at some sample num_node values:
1 node -> 80 + 1_0.5 = 80.5m
10 nodes -> 80 + 10_0.5 = 85m
100 nodes -> 80 + 100*0.5 = 130m
1000 nodes -> 80 + 1000 * 0.5 => 580m
The delta between 10 and 100 nodes is roughly 45m (is that enough?), while from 100 to 1000 is only 450m, which doesn't seem enough. I wonder if we need to have adjustments for various "tiers" of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it's good enough. In fact, we used to have 100m even for 1000nodes in 1.2 release, and it was somehow working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mwielgus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the limits were used for scheduling, but not strictly enforced (ContainerVM didn't support it). The new GCI nodes should be enforcing it, so this may need to be checked and perhaps revised.
Resolved. @fabioy @Q-Lee @andyzheng0831 PTAL |
local -r eventer_memory_per_node="500" | ||
local -r nanny_memory_per_node="200" | ||
if [[ -n "${NUM_NODES:-}" && "${NUM_NODES}" -ge 1 ]]; then | ||
num_kube_nodes="$((${NUM_NODES}+1))" | ||
metrics_memory="$((${num_kube_nodes} * ${metrics_memory_per_node} + 200))Mi" | ||
eventer_memory="$((${num_kube_nodes} * ${eventer_memory_per_node} + 200 * 1024))Ki" | ||
nanny_memory="$((${num_kube_nodes} * ${nanny_memory_per_node} + 90 * 1024))Ki" | ||
metrics_cpu=$(echo - | awk "{print ${num_kube_nodes} * ${metrics_cpu_per_node} + 80}")m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awk is a heavy-weight tool. Can we use the same way as the above line to calculate the variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a float. We could redefine it here to metrics_millicpu_per_node
and keep it in integer space, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that I want it to be 0.5. I can implement it as an ordinary fraction, but that seems even worse than awk. I can switch to bc if that's better for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice it is a float, so please ignore my previous comment about awk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zmerlynn - it's already 0.5 millicpu (0.0005 cpu). So if we want to keep it integer, we should probably move to micro-cpus...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math is hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, not to complain about bash not having floats, but are we OK with awk, should we move to bc, use micro-cpus, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with awk.
Change is ok, just wanted to highlight a comment that I made which may have gotten lost in the discussion: please retest these limits on GCI, where the resource limits are being enforced more strictly (ContainerVM didn't have all the proper flags set). |
@fabioy - this PR is only changing the initial resource; pod-nanny is changing them anyway; and pod-nanny has already been changed few days ago. And that works fine. So there isn't much risk in it. |
@wojtek-t Cool, that's good to know. Thanks. |
GCE e2e build/test passed for commit 8617f70. |
@k8s-bot verify this issue: #IGNORE |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 8617f70. |
Automatic merge from submit-queue |
cc @mwielgus @piosz @zmerlynn