Proposal: Node Allocatable Resources #17201

timstclair · 2015-11-13T02:44:58Z

Proposal for #13984

Topics for discussion:

Which resources must be specified for the system reservation?
Should the reserved resource flags be a serialized map or have a flag per reservable resource?
How should we handle system components overflowing the system reservation?

cc/ @dchen1107 @bgrant0607 @davidopp @mikedanese @karlkfi @kubernetes/goog-node

k8s-github-robot · 2015-11-13T02:52:08Z

Labelling this PR as size/L

k8s-bot · 2015-11-13T03:13:53Z

GCE e2e test build/test passed for commit 214fb85.

davidopp · 2015-11-13T06:16:38Z

docs/proposals/node-allocatable.md

+   "/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage
+   patterns and detecting regressions. Eventually we want to cap system component usage to a certain
+   limit / request. However this is not currently feasible because 1) docker still uses tons of
+   computing resources and 2) there is no NodeSpec yet, so we cannot control Kubenetes nodes or know


"NodeSpec" is confusing because there is a Kubernetes type called NodeSpec. I'd suggest you come up with a different term for this.

Done. I believe this is referring to the node config. @dchen1107 is that correct?

I think she is referring to her proposal to have a standardized node-level conformance test and definition of what components must be present (and perhaps not present) on the node for a platform to be supported. She has been calling it "node spec" but I think it is confusing due to our having a NodeSpec object.

Ah, that makes more sense. Fixed.

timstclair · 2015-11-13T07:12:25Z

Thanks @davidopp

k8s-bot · 2015-11-13T07:34:13Z

GCE e2e test build/test passed for commit 2f125dec11abc7ee6a477a272bc454290dbe3cf8.

k8s-bot · 2015-11-13T07:36:53Z

GCE e2e test build/test passed for commit 50fa720.

sttts · 2015-11-13T18:26:00Z

docs/proposals/node-allocatable.md

+defined to be:
+
+```
+   [Allocatable] = [Machine Capacity] - ( [Reserved] + [System Reserved] + [Kernel usage (smoothed)] )


This formula makes it pretty much impossible to run the kubelet under the control of another "node agent" (like the Mesos slave) which wants to have the control over resources:

[Machine Capacity] is still a number which is derived from the system with some magic inside the kubelet

[Reserved] can be set via command line (proposed), but not changed on-the-fly. Even with the Config API I guess this value can only be constant in the whole cluster. Or will such a Config API allow configs per node?

[System Reserved] + [Kernel usage (smoothed)] collide with the same logic in that "node agent".

What we were hoping for in the case of Mesos is to be able to control [Allocatable] directly, dynamically and independently on each node. This could be done through the api, i.e. the "node agent" would set and update the values in the NodeStatus and the kubelet would use them for admission control. Alternatively some go level call inside the kubelet+mesos-executor code would be fine as well.

IIUC, there are two systems attempting to improve utilization of the resources.
Kubelet requires a constant capacity to guarantee quality of service.
If mesos is altering the capacity of the node, using the allocable construct, that will confuse kubelet.

I vote for a pluggable design in kubelet, where mesos version of kubelet can do whatever it needs, without complicating the rest of the kubelet. The mesos version might have to stub out most of the resource optimizations that the kubelet will perform in the future, because it swaps out the default modules.

A pluggable system sounds good. We run the kubelet code in-process in the executor already. So if we could swap out that part of the default kubelet which takes care of machine capacity, system and kernel resources, that would be great. We can easily provide current [Allocatable] from the executor and the kubelet can keep updating the NodeStatus.

@sttts the formula given above is how Kubelet caculates the node allocatable, and propagate such information to kubernetes upstream layers. In meso's node, swapping system and kernel resources should be very easy, not sure about machine capacity here. We expect capacity from cAdvisor. what about mesos?

I updated the proposal to address the concerns over dynamically adjusting the resources. While Allocatable cannot be controlled directly in this model, SystemReserved & Reserved will eventually be able to be changed per-node.

@sttts

Our problem with this as a framework is that we don't know the underlying logic, i.e. we only see the final resources which correspond to [Allocatable].

Do you mean your node agent is only aware of the mesos equivalent of Allocatable? So you don't know what SystemReserved and Reserved should be set to?

Ok, suppose we replace --reserved with a new flag --allocatable. If --allocatable is set, the Kubelet doesn't do any calculations for those resources, and just reports the value of --allocatable? Would that meet your needs? The flag would be moved to the config API as soon as that is ready, and could be changed at that point (although an eviction policy will not come until later).

@timstclair the node agent (in the sense of mesos-slave) might know details about SystemReserved/Reserved. Kubernetes – as a Mesos framework – only sees the outcome of the resource calculation which is what we get from the SlaveInfo (https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L502), i.e. the equivalent of [Allocable].

Having --allocatable as flag (and later as config API object) would be what we need 👍 This being said, do I understand it correctly that the config API object will be per node? That's important of course as nodes might differ.

As another idea: if we can enforce [Kernel usage (smoothed)] and [System Reserved] to be zero and if we know [Capacity] beforehand, setting [Reserved] to the difference between [Capacity] and what Mesos reports, would suffice as well.

@sttts Kernel usage is observed from the runtime, and we won't enforce it.

What we proposed of having --allocatable means once you configed your node with --allocatable, all other knobs are going to be ignored.

k8s-bot · 2015-11-13T19:17:43Z

GCE e2e build/test failed for commit 5e93c14.

k8s-bot · 2015-11-13T19:48:54Z

GCE e2e build/test failed for commit 2ddfdce.

vishh · 2015-11-13T21:32:58Z

docs/proposals/node-allocatable.md

+- Convert kubelet flags to Config API - Cleaner API, better contract, etc. See
+  [#12245](https://github.com/kubernetes/kubernetes/issues/12245). Once the flags are migrated, we
+  may consider allowing updates to reservations.
+- Set cgroup limits according system reservation - as described in the [overview](#overview)


Note: Kubelet must be compatible with environments where these limits are applied outside of kubelet's scope - for example using systemd.

We were "fighting" with that as well in Mesos, setting the container names to values which made the kubelet to not do any cgroup magic. So +1 from us.

Acknowledged. I don't think this changes anything with the rest of this proposal, since that's already the case today? Do you just mean that when we start setting limits we need to account for limits already being set?

Kubelet should not perform any re-organization of system daemons in certain
environments. We will have to validate that the limits provided to kubelet
are not greater than the cgroup limits imposed by the admin or user.

On Fri, Nov 13, 2015 at 2:42 PM, Tim St. Clair notifications@github.com
wrote:

In docs/proposals/node-allocatable.md
#17201 (comment)
:

+API server expects Allocatable but does not receive it: If the kubelet is older and does not

provide Allocatable in the NodeStatus, the scheduler should fallback to the current behavior

and use Capacity.

+### 3rd party schedulers
+
+The community should be notified that an update to schedulers is recommended, but if a scheduler is
+not updated it falls under the above case of "scheduler is not allocatable-resources aware".
+
+## Future work
+
+- Convert kubelet flags to Config API - Cleaner API, better contract, etc. See

#12245. Once the flags are migrated, we

may consider allowing updates to reservations.
+- Set cgroup limits according system reservation - as described in the overview

Acknowledged. I don't think this changes anything with the rest of this
proposal, since that's already the case today? Do you just mean that when
we start setting limits we need to account for limits already being set?

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17201/files#r44842946.

Acknowledged. This discussion falls outside the scope of this document though.

derekwaynecarr · 2015-11-16T22:43:16Z

docs/proposals/node-allocatable.md

+
+#### System Reserved
+
+- **TODO**: Determine which resources *must* be specified (CPU+memory?)


I think the system needs to support reservation of cpu/memory and local disk storage.

Thanks, updated to reflect this.

dchen1107 · 2015-12-04T23:53:41Z

LGTM. could you please do a git rebase and squash those commits? We are ready to go. :-)

timstclair · 2015-12-05T00:05:46Z

Thanks! Done.

dchen1107 · 2015-12-05T00:13:47Z

@derekwaynecarr re: #17201 (comment)

Setting a fixed amount of System Reserved is something I want to get away from in the next milestone. Asking operators to assign a fixed amount of resource for daemons that Kubernetes brings to their host (kubelet, kube-proxy, docker) is a really difficult conversation and doesn't work well based on the shape of the pods that may get placed on the node. If I run a 1G pod, and 50MB pod on the same node, my system reserved overhead can be much lower than if I run 40 50MB pods on that node.

System Reserved is renamed to Kube Reserved now. In a long run, we don't expect operators to assign a fixed amount of resource to kubernetes daemons. Instead, we are going to calculate that based on a formula roughly like this:

KubeReserved = benchmarked kube-daemons' usage + max_pods_per_node * management_overhead_per_pod + padding.

For each release, we are going to measure the benchmark. We already attempt to measure such stats for 1.1 release, but not finalize those numbers given both kubelet and docker are changed dramatically over each release.

max_pods_per_node is a kubelet flag, and eventually it should be configured based on the machine size.

k8s-bot · 2015-12-05T00:44:49Z

GCE e2e build/test failed for commit 1d0b59fba6fb502d89f34e144873485e6720755f.

wojtek-t · 2015-12-06T07:47:57Z

@timstclair - can you please fix the error:

FAIL: changes needed but not made due to --verify
/home/travis/gopath/src/github.com/kubernetes/kubernetes/docs/ is out of date. Please run hack/update-generated-docs.sh
!!! Error in hack/verify-generated-docs.sh:28

timstclair · 2015-12-07T18:39:52Z

@k8s-bot unit test this

timstclair · 2015-12-08T00:44:36Z

Ran hack/update-generated-docs.sh

k8s-github-robot · 2015-12-08T00:44:44Z

PR changed after LGTM, removing LGTM.

dchen1107 · 2015-12-08T00:50:43Z

LGTM again!

k8s-bot · 2015-12-08T01:43:15Z

GCE e2e test build/test passed for commit 54a6da9.

wojtek-t · 2015-12-08T07:13:32Z

@k8s-bot unit test this please

Proposal: Node Allocatable Resources

dchen1107 · 2015-12-09T22:55:53Z

cc/ @jdef

Adds `Allocatable` to `NodeStatus`, as described in [17201](kubernetes#17201)

timstclair assigned dchen1107 Nov 13, 2015

googlebot added the cla: yes label Nov 13, 2015

timstclair added the e2e-not-required label Nov 13, 2015

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2015

davidopp reviewed Nov 13, 2015
View reviewed changes

timstclair force-pushed the allocatable branch from 2f125de to 50fa720 Compare November 13, 2015 07:06

sttts reviewed Nov 13, 2015
View reviewed changes

vishh reviewed Nov 13, 2015
View reviewed changes

dchen1107 mentioned this pull request Nov 16, 2015

Out of resource killing #17186

Closed

derekwaynecarr reviewed Nov 16, 2015
View reviewed changes

timstclair mentioned this pull request Dec 4, 2015

Add Allocatable to NodeStatus #18189

Merged

timstclair force-pushed the allocatable branch from 3ee7e29 to 1d0b59f Compare December 5, 2015 00:05

timstclair added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2015

bgrant0607 mentioned this pull request Dec 7, 2015

e2e flake: SchedulerPredicates validates resource limits of pods that are allowed to run [Conformance] #18211

Closed

Node Allocatable resources proposal

54a6da9

timstclair force-pushed the allocatable branch from 1d0b59f to 54a6da9 Compare December 8, 2015 00:44

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2015

timstclair added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2015

dchen1107 added a commit that referenced this pull request Dec 8, 2015

Merge pull request #17201 from timstclair/allocatable

41cb3d6

Proposal: Node Allocatable Resources

dchen1107 merged commit 41cb3d6 into kubernetes:master Dec 8, 2015

jdef mentioned this pull request Dec 9, 2015

AS a single Service Instance (Tenant) within a Mesos cluster, K8s should play nicer WRT node ownership. d2iq-archive/kubernetes-mesos#685

Open

7 tasks

timstclair pushed a commit to timstclair/kubernetes that referenced this pull request Dec 11, 2015

Add Allocatable to NodeStatus

3db839e

Adds `Allocatable` to `NodeStatus`, as described in [17201](kubernetes#17201)

timstclair pushed a commit to timstclair/kubernetes that referenced this pull request Dec 16, 2015

Add Allocatable to NodeStatus

8fdd321

Adds `Allocatable` to `NodeStatus`, as described in [17201](kubernetes#17201)

dchen1107 mentioned this pull request Mar 23, 2016

Increase maximum pods per node #23349

Open


		#### System Reserved

		- TODO: Determine which resources must be specified (CPU+memory?)

Proposal: Node Allocatable Resources #17201

Proposal: Node Allocatable Resources #17201

Conversation

timstclair commented Nov 13, 2015

k8s-github-robot commented Nov 13, 2015

k8s-bot commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timstclair commented Nov 13, 2015

k8s-bot commented Nov 13, 2015

k8s-bot commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-bot commented Nov 13, 2015

k8s-bot commented Nov 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchen1107 commented Dec 4, 2015

timstclair commented Dec 5, 2015

dchen1107 commented Dec 5, 2015

k8s-bot commented Dec 5, 2015

wojtek-t commented Dec 6, 2015

timstclair commented Dec 7, 2015

timstclair commented Dec 8, 2015

k8s-github-robot commented Dec 8, 2015

dchen1107 commented Dec 8, 2015

k8s-bot commented Dec 8, 2015

wojtek-t commented Dec 8, 2015

dchen1107 commented Dec 9, 2015