-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Node Allocatable Resources #17201
Conversation
Labelling this PR as size/L |
GCE e2e test build/test passed for commit 214fb85. |
"/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage | ||
patterns and detecting regressions. Eventually we want to cap system component usage to a certain | ||
limit / request. However this is not currently feasible because 1) docker still uses tons of | ||
computing resources and 2) there is no NodeSpec yet, so we cannot control Kubenetes nodes or know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"NodeSpec" is confusing because there is a Kubernetes type called NodeSpec. I'd suggest you come up with a different term for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I believe this is referring to the node config. @dchen1107 is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think she is referring to her proposal to have a standardized node-level conformance test and definition of what components must be present (and perhaps not present) on the node for a platform to be supported. She has been calling it "node spec" but I think it is confusing due to our having a NodeSpec object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes more sense. Fixed.
2f125de
to
50fa720
Compare
Thanks @davidopp |
GCE e2e test build/test passed for commit 2f125dec11abc7ee6a477a272bc454290dbe3cf8. |
GCE e2e test build/test passed for commit 50fa720. |
defined to be: | ||
|
||
``` | ||
[Allocatable] = [Machine Capacity] - ( [Reserved] + [System Reserved] + [Kernel usage (smoothed)] ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formula makes it pretty much impossible to run the kubelet under the control of another "node agent" (like the Mesos slave) which wants to have the control over resources:
- [Machine Capacity] is still a number which is derived from the system with some magic inside the kubelet
- [Reserved] can be set via command line (proposed), but not changed on-the-fly. Even with the Config API I guess this value can only be constant in the whole cluster. Or will such a Config API allow configs per node?
- [System Reserved] + [Kernel usage (smoothed)] collide with the same logic in that "node agent".
What we were hoping for in the case of Mesos is to be able to control [Allocatable] directly, dynamically and independently on each node. This could be done through the api, i.e. the "node agent" would set and update the values in the NodeStatus and the kubelet would use them for admission control. Alternatively some go level call inside the kubelet+mesos-executor code would be fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, there are two systems attempting to improve utilization of the resources.
Kubelet requires a constant capacity
to guarantee quality of service.
If mesos is altering the capacity of the node, using the allocable
construct, that will confuse kubelet.
I vote for a pluggable design in kubelet, where mesos version of kubelet can do whatever it needs, without complicating the rest of the kubelet. The mesos version might have to stub out most of the resource optimizations that the kubelet will perform in the future, because it swaps out the default modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pluggable system sounds good. We run the kubelet code in-process in the executor already. So if we could swap out that part of the default kubelet which takes care of machine capacity, system and kernel resources, that would be great. We can easily provide current [Allocatable] from the executor and the kubelet can keep updating the NodeStatus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts the formula given above is how Kubelet caculates the node allocatable, and propagate such information to kubernetes upstream layers. In meso's node, swapping system and kernel resources should be very easy, not sure about machine capacity here. We expect capacity from cAdvisor. what about mesos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the proposal to address the concerns over dynamically adjusting the resources. While Allocatable cannot be controlled directly in this model, SystemReserved & Reserved will eventually be able to be changed per-node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our problem with this as a framework is that we don't know the underlying logic, i.e. we only see the final resources which correspond to [Allocatable].
Do you mean your node agent is only aware of the mesos equivalent of Allocatable? So you don't know what SystemReserved
and Reserved
should be set to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, suppose we replace --reserved
with a new flag --allocatable
. If --allocatable
is set, the Kubelet doesn't do any calculations for those resources, and just reports the value of --allocatable
? Would that meet your needs? The flag would be moved to the config API as soon as that is ready, and could be changed at that point (although an eviction policy will not come until later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timstclair the node agent (in the sense of mesos-slave) might know details about SystemReserved
/Reserved
. Kubernetes – as a Mesos framework – only sees the outcome of the resource calculation which is what we get from the SlaveInfo
(https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L502), i.e. the equivalent of [Allocable].
Having --allocatable
as flag (and later as config API object) would be what we need 👍 This being said, do I understand it correctly that the config API object will be per node? That's important of course as nodes might differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As another idea: if we can enforce [Kernel usage (smoothed)]
and [System Reserved]
to be zero and if we know [Capacity]
beforehand, setting [Reserved]
to the difference between [Capacity]
and what Mesos reports, would suffice as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts Kernel usage is observed from the runtime, and we won't enforce it.
What we proposed of having --allocatable means once you configed your node with --allocatable, all other knobs are going to be ignored.
GCE e2e build/test failed for commit 5e93c14. |
GCE e2e build/test failed for commit 2ddfdce. |
- Convert kubelet flags to Config API - Cleaner API, better contract, etc. See | ||
[#12245](https://github.com/kubernetes/kubernetes/issues/12245). Once the flags are migrated, we | ||
may consider allowing updates to reservations. | ||
- Set cgroup limits according system reservation - as described in the [overview](#overview) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Kubelet must be compatible with environments where these limits are applied outside of kubelet's scope - for example using systemd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were "fighting" with that as well in Mesos, setting the container names to values which made the kubelet to not do any cgroup magic. So +1 from us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. I don't think this changes anything with the rest of this proposal, since that's already the case today? Do you just mean that when we start setting limits we need to account for limits already being set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubelet should not perform any re-organization of system daemons in certain
environments. We will have to validate that the limits provided to kubelet
are not greater than the cgroup limits imposed by the admin or user.
On Fri, Nov 13, 2015 at 2:42 PM, Tim St. Clair notifications@github.com
wrote:
In docs/proposals/node-allocatable.md
#17201 (comment)
:
+API server expects
Allocatable
but does not receive it: If the kubelet is older and does not
- provide
Allocatable
in theNodeStatus
, the scheduler should fallback to the current behavior- and use
Capacity
.
+### 3rd party schedulers
+
+The community should be notified that an update to schedulers is recommended, but if a scheduler is
+not updated it falls under the above case of "scheduler is not allocatable-resources aware".
+
+## Future work
+
+- Convert kubelet flags to Config API - Cleaner API, better contract, etc. SeeAcknowledged. I don't think this changes anything with the rest of this
proposal, since that's already the case today? Do you just mean that when
we start setting limits we need to account for limits already being set?—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17201/files#r44842946.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged. This discussion falls outside the scope of this document though.
|
||
#### System Reserved | ||
|
||
- **TODO**: Determine which resources *must* be specified (CPU+memory?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the system needs to support reservation of cpu/memory and local disk storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated to reflect this.
LGTM. could you please do a git rebase and squash those commits? We are ready to go. :-) |
3ee7e29
to
1d0b59f
Compare
Thanks! Done. |
@derekwaynecarr re: #17201 (comment)
System Reserved is renamed to Kube Reserved now. In a long run, we don't expect operators to assign a fixed amount of resource to kubernetes daemons. Instead, we are going to calculate that based on a formula roughly like this: KubeReserved = benchmarked kube-daemons' usage + max_pods_per_node * management_overhead_per_pod + padding. For each release, we are going to measure the benchmark. We already attempt to measure such stats for 1.1 release, but not finalize those numbers given both kubelet and docker are changed dramatically over each release. max_pods_per_node is a kubelet flag, and eventually it should be configured based on the machine size. |
GCE e2e build/test failed for commit 1d0b59fba6fb502d89f34e144873485e6720755f. |
@timstclair - can you please fix the error: FAIL: changes needed but not made due to --verify |
@k8s-bot unit test this |
1d0b59f
to
54a6da9
Compare
Ran |
PR changed after LGTM, removing LGTM. |
LGTM again! |
GCE e2e test build/test passed for commit 54a6da9. |
@k8s-bot unit test this please |
Proposal: Node Allocatable Resources
cc/ @jdef |
Adds `Allocatable` to `NodeStatus`, as described in [17201](kubernetes#17201)
Adds `Allocatable` to `NodeStatus`, as described in [17201](kubernetes#17201)
Proposal for #13984
Topics for discussion:
cc/ @dchen1107 @bgrant0607 @davidopp @mikedanese @karlkfi @kubernetes/goog-node