Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Allocable to API Node.Status #13984

Closed
dchen1107 opened this issue Sep 15, 2015 · 10 comments
Closed

Introduce Allocable to API Node.Status #13984

dchen1107 opened this issue Sep 15, 2015 · 10 comments
Assignees
Labels
area/api Indicates an issue on api area. area/system-requirement priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@dchen1107
Copy link
Member

Currently Node.Status has Capacity, but no concept of Machine Allocable to serve several purposes:

  • For Kubernetes 1.0 release, we introduced "/docker-daemon", "/kubelet", "/kube-proxy", "/system" etc. raw containers, so that we could monitor their resource usage pattern, and detects regression easily. In a long run, we want cap their usage uncertain limit / request. We don't do it know because 1) docker is still using tons of computing resource, and consequence of constraint docker's resource consumption is pretty high. 2) There is no NodeSpec yet, and we cannot control Kubenetes nodes, and OSS users might introduce arbitrary daemons to a given node which makes /system unmanageable. Even with above resources, we cannot do a full resource management / control on the node, but introducing Allocable concept to node could prevent real bad resource overcommit.
  • For mesos, hadoop, etc. integration, they might want to partition compute resource on a given node, and limit how much Kubelet use; meanwhile they can query kubelet, and reserve some portion of the rest for their own purpose.

I proposed

  1. Introduce a flag called --allocable-resources to Kubelet for now, and reports Allocable to upstream layers. In a long run, we could replace such flag with more sophisticated configuration through some machine / node3 management components.
  2. All upstream control components including scheduler, kubelet should do feasibility checking against Node.Status.Allocable, instead of Capacity.

cc/ @bgrant0607 @davidopp @sttts @karlkfi @vishh

@dchen1107 dchen1107 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/api Indicates an issue on api area. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 15, 2015
@dchen1107
Copy link
Member Author

cc/ @kubernetes/goog-node

@sttts
Copy link
Contributor

sttts commented Sep 15, 2015

Allocable values should be changeable during the node life cycle. In the Mesos case, the resources of a slave might change dynamically (technically when the executor reregisters). It's enough if the values can be patched in the apiserver object, from the executor.

@bgrant0607
Copy link
Member

I assume the kubelet will post the value to apiserver.

Would a Kubelet config file be easier for you to update than its flags? #12245 is in progress, and we also hope to wrap up #1627 in the 1.2 timeframe.

@karlkfi
Copy link
Contributor

karlkfi commented Sep 17, 2015

As long as the config is settable for initial launch of a new kubelet and updatable at runtime, I'm not sure the method is super important.

@dchen1107
Copy link
Member Author

#14532

@derekwaynecarr
Copy link
Member

The primary operator goal here is that I should be able to eliminate the need to do a static-pod for resource reservation, and the kubelet should support a dynamic resource reservation model for incompressible resources like memory/disk. For things like CPU, I know we have issues where CPU usage spikes as the number of pods on the node increases, but I am less concerned on that in the near term.

I need to take a deeper look tomorrow, but I think I recall that there are open issues to resolve around how we re-parent system daemons when running in a systemd environment.

Open question:

If/when we reparent all containers in a common cgroup based on qos tier, do you guys have any thoughts on differentiating allocable based on qos tier at all?

@timstclair
Copy link

I'm not sure I understand the question. Are you proposing having different reservations at different QoS tiers? I don't see how that would work since kubelet doesn't control what is running in the reserved portions.

@vishh
Copy link
Contributor

vishh commented Nov 16, 2015

I need to take a deeper look tomorrow, but I think I recall that there are open issues to resolve around how we re-parent system daemons when running in a systemd environment.

Kubelet can auto-detect systemd deployments and avoid re-parenting system daemons.

If/when we reparent all containers in a common cgroup based on qos tier, do you guys have any thoughts on differentiating allocable based on qos tier at all?

Are you referring to per qos class quota? If the node exposes detailed usage information, the policy around how the resource are distributed across qos classes can probably be managed in higher layers.

@derekwaynecarr
Copy link
Member

@vishh - makes sense.

@dchen1107
Copy link
Member Author

I am closing this one. We are going to measure once the release is cut and decide the values for those flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/system-requirement priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants