Description
There are lots of reasons why the "system" wants to look at an object, like a pod, and modify and/or act on it. There are also a number of places where we can put in "hooks" for these actions. We'll end up with a better system in the long run, I think, if we put some thought into what hooks to use in what situations.
Places for hooks
- kubectl, other clients
- proxy in front of apiserver
- apiserver
- post-apiserver
Within apiserver, there are both hardcoded actions on objects, and extensible ones (such as what @derekwaynecarr has implemented with "admission control", #3472 and #3319).
The last one is a somewhat new concept, so it deserves a bit of explanation. After a pod is POST'ed to the apiserver, and persisted to etcd, some other component (a different process than apiserver), which is watching for new objects, will see it, and act on it and/or modify it. It should not start running on a Kubelet until all the things that need to act on it have done so. @smarterclayton called this a "finalizer". The scheduler can be viewed as a finalizer that takes pods that have everything except a HostIP set, and it sets the HostIP.
Pros/cons of each hook location
- kubectl, other clients
- hooks won't be mandatory. users could just use a different client.
- updating all clients to change behavior is very difficult.
- can interact with users local directory, templates, etc...
- proxy
- can be made manadatory,
- may be a pain to setup?
- apiserver
- can prevent object from ever being persisted to storage
- currently only place where atomic read-then-write is possible.
- adding an action requires modifying apiserver binary,
- adding an action requires checking in changes to kubernetes project or maintaining a branch.
- post-apiserver (finalizer)
- easy for users to customize by maintaining their own component, in separate repository if necessary
- can't "reject" an object. hard for user to understand if the POST succeeds but the pod never runs.
Use cases for hooked actions
- Set default resource limits on pods
- Reject pods with resource limits above or below sane levels. Admission control plugins: LimitRanger and ResourceQuota #3057
- Limit aggregate amount of resources used by a tenant, or by objects matching some selector. Admission control plugins: LimitRanger and ResourceQuota #3057
- Reject Pods with resource limits which are difficult shapes for the system to schedule (e.g. lots of ram and very little CPU or vice versa.)
- Prevent creation of too many objects of any kind, either because it is obviously user error, or because it will hurt the system. Admission control plugins: LimitRanger and ResourceQuota #3057
- Schedule pods to nodes (the existing scheduler)
- setup network routes for pods (@pmorie working on something like this for OpenShift, I think)
- custom allocator for IP addresses, vlans, etc (Proposal: deouple networking for segmentation and other use cases #3350)
- something that distributes secrets to nodes via some side channel for the pods to use.
- pod limit auto-adjuster
Which hooks to use for what
Suggested guidelines for what hooks to use for what types of actions
- Prefer in apiserver if need to prevent some object from ever "executing"
- Prefer to say no as early as possible for debugging.
- e.g. resource quotas in apiserver
- Put it in the apiserver if the act of persisting of the object could be harmful
- protect apiserver storage space
- e.g. object quotas in apiserver
- Prefer finalizer (outside apiserver) otherwise.
- No need to recompile apiserver or commit to github.
- Separation of responsibilities
Next steps
- debate the above proposal
On finalizers from #3586
Read #3585 too.
We should have a general framework for a pod or other object to be POST'ed in an incomplete state, and persisted to etcd, and then subsequently to be handled by a series of "finalizers" that fill in missing fields. Once all are filled in, the object can be picked up by a kubelet and run.
Use cases
Use cases for filling in fields in pods after they are stored.
- Set default resource limits on pods
- Schedule pods to nodes (the existing scheduler sets the HostIP field)
- custom allocator for PodIP addresses (and setup vlans, etc; Proposal: deouple networking for segmentation and other use cases #3350)
- pod limit auto-adjuster. sets unspecified cpu and memory limits
- pod template. a permanently underspecified pod could be a template for a replication controller
Availability
Availability of the cluster limited by the finalizers, so they need to be replicated. Fortunately, because they typically act on one object at a time, and can use resource versioning, it should be easy to parallelize them.
Bootstrapping
There has to be some way to get pods onto minions without waiting for finalizers to act on them, when turning up a cluster, or upgrading a finalizer. The scheduler is a special case of this.
Some options:
- allow privileged user to write Pods to apiserver with all fields finalized, including HostIP, so that kubelets pick them up immediately
- talk directly to a particular kubelet an make it start a pod. We should make kubelets accept pods to run in "api.Pod" format.
- do rolling updates of finalizers wherever possible, so that there is always at least one good one around to help out.
State and sequence
- what phase/condition/reason pods have are in as they work their way through finalizers
- sequencing and composing multiple finalizers; how finalizer knows when it is its turn to act.