Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Node Allocatable Resources #17201

Merged
merged 1 commit into from
Dec 8, 2015

Conversation

timstclair
Copy link

Proposal for #13984

Topics for discussion:

  • Which resources must be specified for the system reservation?
  • Should the reserved resource flags be a serialized map or have a flag per reservable resource?
  • How should we handle system components overflowing the system reservation?

cc/ @dchen1107 @bgrant0607 @davidopp @mikedanese @karlkfi @kubernetes/goog-node

@k8s-github-robot
Copy link

Labelling this PR as size/L

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2015
@k8s-bot
Copy link

k8s-bot commented Nov 13, 2015

GCE e2e test build/test passed for commit 214fb85.

"/kube-proxy", "/system" etc. raw containers for monitoring system component resource usage
patterns and detecting regressions. Eventually we want to cap system component usage to a certain
limit / request. However this is not currently feasible because 1) docker still uses tons of
computing resources and 2) there is no NodeSpec yet, so we cannot control Kubenetes nodes or know
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NodeSpec" is confusing because there is a Kubernetes type called NodeSpec. I'd suggest you come up with a different term for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I believe this is referring to the node config. @dchen1107 is that correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think she is referring to her proposal to have a standardized node-level conformance test and definition of what components must be present (and perhaps not present) on the node for a platform to be supported. She has been calling it "node spec" but I think it is confusing due to our having a NodeSpec object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes more sense. Fixed.

@timstclair
Copy link
Author

Thanks @davidopp

@k8s-bot
Copy link

k8s-bot commented Nov 13, 2015

GCE e2e test build/test passed for commit 2f125dec11abc7ee6a477a272bc454290dbe3cf8.

@k8s-bot
Copy link

k8s-bot commented Nov 13, 2015

GCE e2e test build/test passed for commit 50fa720.

defined to be:

```
[Allocatable] = [Machine Capacity] - ( [Reserved] + [System Reserved] + [Kernel usage (smoothed)] )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula makes it pretty much impossible to run the kubelet under the control of another "node agent" (like the Mesos slave) which wants to have the control over resources:

  • [Machine Capacity] is still a number which is derived from the system with some magic inside the kubelet
  • [Reserved] can be set via command line (proposed), but not changed on-the-fly. Even with the Config API I guess this value can only be constant in the whole cluster. Or will such a Config API allow configs per node?
  • [System Reserved] + [Kernel usage (smoothed)] collide with the same logic in that "node agent".

What we were hoping for in the case of Mesos is to be able to control [Allocatable] directly, dynamically and independently on each node. This could be done through the api, i.e. the "node agent" would set and update the values in the NodeStatus and the kubelet would use them for admission control. Alternatively some go level call inside the kubelet+mesos-executor code would be fine as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, there are two systems attempting to improve utilization of the resources.
Kubelet requires a constant capacity to guarantee quality of service.
If mesos is altering the capacity of the node, using the allocable construct, that will confuse kubelet.

I vote for a pluggable design in kubelet, where mesos version of kubelet can do whatever it needs, without complicating the rest of the kubelet. The mesos version might have to stub out most of the resource optimizations that the kubelet will perform in the future, because it swaps out the default modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pluggable system sounds good. We run the kubelet code in-process in the executor already. So if we could swap out that part of the default kubelet which takes care of machine capacity, system and kernel resources, that would be great. We can easily provide current [Allocatable] from the executor and the kubelet can keep updating the NodeStatus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts the formula given above is how Kubelet caculates the node allocatable, and propagate such information to kubernetes upstream layers. In meso's node, swapping system and kernel resources should be very easy, not sure about machine capacity here. We expect capacity from cAdvisor. what about mesos?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the proposal to address the concerns over dynamically adjusting the resources. While Allocatable cannot be controlled directly in this model, SystemReserved & Reserved will eventually be able to be changed per-node.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts

Our problem with this as a framework is that we don't know the underlying logic, i.e. we only see the final resources which correspond to [Allocatable].

Do you mean your node agent is only aware of the mesos equivalent of Allocatable? So you don't know what SystemReserved and Reserved should be set to?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, suppose we replace --reserved with a new flag --allocatable. If --allocatable is set, the Kubelet doesn't do any calculations for those resources, and just reports the value of --allocatable? Would that meet your needs? The flag would be moved to the config API as soon as that is ready, and could be changed at that point (although an eviction policy will not come until later).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timstclair the node agent (in the sense of mesos-slave) might know details about SystemReserved/Reserved. Kubernetes – as a Mesos framework – only sees the outcome of the resource calculation which is what we get from the SlaveInfo (https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L502), i.e. the equivalent of [Allocable].

Having --allocatable as flag (and later as config API object) would be what we need 👍 This being said, do I understand it correctly that the config API object will be per node? That's important of course as nodes might differ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As another idea: if we can enforce [Kernel usage (smoothed)] and [System Reserved] to be zero and if we know [Capacity] beforehand, setting [Reserved] to the difference between [Capacity] and what Mesos reports, would suffice as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts Kernel usage is observed from the runtime, and we won't enforce it.

What we proposed of having --allocatable means once you configed your node with --allocatable, all other knobs are going to be ignored.

@k8s-bot
Copy link

k8s-bot commented Nov 13, 2015

GCE e2e build/test failed for commit 5e93c14.

@k8s-bot
Copy link

k8s-bot commented Nov 13, 2015

GCE e2e build/test failed for commit 2ddfdce.

- Convert kubelet flags to Config API - Cleaner API, better contract, etc. See
[#12245](https://github.com/kubernetes/kubernetes/issues/12245). Once the flags are migrated, we
may consider allowing updates to reservations.
- Set cgroup limits according system reservation - as described in the [overview](#overview)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Kubelet must be compatible with environments where these limits are applied outside of kubelet's scope - for example using systemd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were "fighting" with that as well in Mesos, setting the container names to values which made the kubelet to not do any cgroup magic. So +1 from us.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. I don't think this changes anything with the rest of this proposal, since that's already the case today? Do you just mean that when we start setting limits we need to account for limits already being set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet should not perform any re-organization of system daemons in certain
environments. We will have to validate that the limits provided to kubelet
are not greater than the cgroup limits imposed by the admin or user.

On Fri, Nov 13, 2015 at 2:42 PM, Tim St. Clair notifications@github.com
wrote:

In docs/proposals/node-allocatable.md
#17201 (comment)
:

+API server expects Allocatable but does not receive it: If the kubelet is older and does not

  • provide Allocatable in the NodeStatus, the scheduler should fallback to the current behavior
  • and use Capacity.

+### 3rd party schedulers
+
+The community should be notified that an update to schedulers is recommended, but if a scheduler is
+not updated it falls under the above case of "scheduler is not allocatable-resources aware".
+
+## Future work
+
+- Convert kubelet flags to Config API - Cleaner API, better contract, etc. See

  • #12245. Once the flags are migrated, we
  • may consider allowing updates to reservations.
    +- Set cgroup limits according system reservation - as described in the overview

Acknowledged. I don't think this changes anything with the rest of this
proposal, since that's already the case today? Do you just mean that when
we start setting limits we need to account for limits already being set?


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17201/files#r44842946.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. This discussion falls outside the scope of this document though.


#### System Reserved

- **TODO**: Determine which resources *must* be specified (CPU+memory?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the system needs to support reservation of cpu/memory and local disk storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated to reflect this.

@dchen1107
Copy link
Member

LGTM. could you please do a git rebase and squash those commits? We are ready to go. :-)

@timstclair
Copy link
Author

Thanks! Done.

@timstclair timstclair added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2015
@dchen1107
Copy link
Member

@derekwaynecarr re: #17201 (comment)

Setting a fixed amount of System Reserved is something I want to get away from in the next milestone. Asking operators to assign a fixed amount of resource for daemons that Kubernetes brings to their host (kubelet, kube-proxy, docker) is a really difficult conversation and doesn't work well based on the shape of the pods that may get placed on the node. If I run a 1G pod, and 50MB pod on the same node, my system reserved overhead can be much lower than if I run 40 50MB pods on that node.

System Reserved is renamed to Kube Reserved now. In a long run, we don't expect operators to assign a fixed amount of resource to kubernetes daemons. Instead, we are going to calculate that based on a formula roughly like this:

KubeReserved = benchmarked kube-daemons' usage + max_pods_per_node * management_overhead_per_pod + padding.

For each release, we are going to measure the benchmark. We already attempt to measure such stats for 1.1 release, but not finalize those numbers given both kubelet and docker are changed dramatically over each release.

max_pods_per_node is a kubelet flag, and eventually it should be configured based on the machine size.

@k8s-bot
Copy link

k8s-bot commented Dec 5, 2015

GCE e2e build/test failed for commit 1d0b59fba6fb502d89f34e144873485e6720755f.

@wojtek-t
Copy link
Member

wojtek-t commented Dec 6, 2015

@timstclair - can you please fix the error:

FAIL: changes needed but not made due to --verify
/home/travis/gopath/src/github.com/kubernetes/kubernetes/docs/ is out of date. Please run hack/update-generated-docs.sh
!!! Error in hack/verify-generated-docs.sh:28

@timstclair
Copy link
Author

@k8s-bot unit test this

@timstclair
Copy link
Author

Ran hack/update-generated-docs.sh

@k8s-github-robot
Copy link

PR changed after LGTM, removing LGTM.

@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2015
@timstclair timstclair added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 8, 2015
@dchen1107
Copy link
Member

LGTM again!

@k8s-bot
Copy link

k8s-bot commented Dec 8, 2015

GCE e2e test build/test passed for commit 54a6da9.

@wojtek-t
Copy link
Member

wojtek-t commented Dec 8, 2015

@k8s-bot unit test this please

@dchen1107
Copy link
Member

cc/ @jdef

timstclair pushed a commit to timstclair/kubernetes that referenced this pull request Dec 11, 2015
Adds `Allocatable` to `NodeStatus`, as described in
[17201](kubernetes#17201)
timstclair pushed a commit to timstclair/kubernetes that referenced this pull request Dec 16, 2015
Adds `Allocatable` to `NodeStatus`, as described in
[17201](kubernetes#17201)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.