Kubelet to understand pods, and to be able to pull from apiserver #2483

erictune · 2014-11-20T01:07:36Z

Several inter-related goals:

Kubelet continues to be usable in isolation, as well as in a Kubernetes cluster. (composability)
Config defined in terms of Pods should work both with an apiserver, and directly with Kubelets (bootstraping, debugging, composability)
Kubelet's can only see pods that they need to see (security)
Don't create more apiserver API objects than necessary (ease of understanding)
Node etcd need not be connected to master etcd. (security, isolation, scalability)
Not have clients (be they kubectl, or kubelets or whatever) depend on the storage layer of the apiserver. (abstraction, allowing multiple implementations of the API)

Current state:

Kubelet can read container manifest from a local file. Or from an URL. Used by Google ContainerVM.
Kubelets can read boundPods from their local etcd, which is connected to the master machine's etcd.
Kubelets can talk to the apiserver, and do write /events. They don't read /pods or /boundPods.

Suggested changes:

Kubelet learns the definition of type Pods.
Kubelet can read json pod definitions from local files or URLs.
- It also still supports containerManifests for the foreseeable future, with some way for it to determine which type to expect from a source.
In a "typical" Kubernetes cluster, a Kubelet watches /api/v1beta3/pods for what pods it should run.
Get rid of BoundPod and BoundPods since nothing reads them anymore.
Add a Host field to the PodStatus (but not PodSpec).
When scheduler writes a Binding, the PodStatus.Host is be set and resourceVersion is updated.
A cluster is bootstrapped by first starting one or more VMs with,

Concerns that may be raised and responses:

Q1: Are changes to the set of pods bound made by the scheduler machine atomic or eventually consistent?

A1: It could work either way. If we want atomic behavior, we could implement that in apiserver more readily that we could when we directly expose our storage via etcd.

Q2: Should the kubelet be allowed to see CurrentState (now PodStatus in v1beta3)? It generates (some of) the status, so why let it see that?

A2: We could implement this if it is important. Kubelet would watch pods with a selector that matches only pods with PodStatus.Host == kubelets hostname, and could use a field selector so that only PodSpec and not PodStatus is returned.

Q3: How do we prevent Kubelets from seeing other node's pods.
A3: There are a couple of ways I can think to do this with small changes to our current authorization policy.

One way is to have a distinguished "kubelet" user and special case its authorization.
Another is to create a new policy as each "kubelet" is added which matches on the SourceIP of the request, and requires a selector to be part of the request which selects on "PodStatus.Host=$SourceIP". This may make some assumptions about the network security of the cluster, but seems like it could work.
A variation on the previous is to only have one line of policy for all kubelets, but have a "condition" field in the policy that checks that PodStatus.Host matches SourceIP.
Another variation is to have a different token for each kubelet and a separate policy for each.

erictune · 2014-11-20T01:09:00Z

@a-robinson @roberthbailey
I think both of you were interested in how to remove the dependency between master etcd and kubelet. This proposal, if implemented, would answer that question.

erictune · 2014-11-20T01:12:46Z

@thockin
I know you had some opinions about BoundPods. But, since we last discussed that, there have been changes to authorization, kubelet-to-apiserver communication (events), and the restructuring of desiredState/currentState to PodSpec/PodStatus (v1beta3).

lavalamp · 2014-11-20T01:16:56Z

In a "typical" Kubernetes cluster, a Kubelet watches /api/v1beta3/pods for what pods it should run.

Get rid of BoundPod and BoundPods since nothing reads them anymore.

This is going to be controversial. For expediency's sake, I would like to start out with kubelet watching /api/v1beta3/boundPods. The boundPods/pods argument can be had later.

erictune · 2014-11-20T01:29:47Z

Filed #2484 @lavalamp

thockin · 2014-11-20T01:44:05Z

The goal of BoundPods was to simplify Kubelet and the kubelet experience by
having it only understand what it needs to know in order to do its job.

I'm not against teaching Kubelet about Pods instead of BoundPods, but a
number of questions (as you called out) remain.

Whatever watch the kubelet does probably has to be scalable - we don't want
it waking up every time any pod in a cluster is created/destroyed/updated.

I dislike the magic that happens between scheduler and apiserver with
Bindings - it is a pet belief of mine (in fact I satrted fiddling with it
on the plane) that the REST objects should be less entangled than they are.

On Wed, Nov 19, 2014 at 5:30 PM, Eric Tune notifications@github.com wrote:

Filed #2484
#2484 @lavalamp
https://github.com/lavalamp

Reply to this email directly or view it on GitHub
#2483 (comment)
.

erictune · 2014-11-20T17:09:43Z

The selector on the watch which I described should prevent excessive kubelet wakeups.

An improvement over the current situation with Bindings might be to replace:

POST 
/api/v1beta3/bindings 
{ ...
"podID": "...",
"host": "....",
}

with a body-less POST to a special verb, like this:

POST
/api/v1beta3/bind/pods/mypod?host="...."

That would mean we could delete type Bindings from pkg/api/types.go, which seems like it would address your concern about entangled REST objects.

lavalamp · 2014-11-20T18:44:22Z

with a body-less POST to a special verb

That's much harder to change in the future if we want to, say, add a field. (Imagine the scheduler wanting to claim ssd0 for the pod or something.)

erictune · 2014-11-25T21:12:49Z

@thockin points out that I need to change kube-proxy to watch /endpoints.

erictune · 2014-11-26T21:30:22Z

Just realized that this is a dup of #860

erictune · 2014-12-02T18:40:49Z

Related discussion happening on #846

erictune · 2014-12-02T18:41:27Z

More motivation for this change: #2715

bgrant0607 · 2014-12-18T21:58:42Z

Yet another motivation to get rid of BoundPods: We want to allow users to specify the host directly -- #3020.

/cc @brendandburns

bgrant0607 · 2014-12-18T22:00:34Z

@erictune @dchen1107 @lavalamp @smarterclayton Is anyone working on making the Kubelet deal with pods rather than BoundPods?

dchen1107 · 2014-12-18T22:03:43Z

I am not at this moment, but if there is no one working on it. I can pick up this tomorrow. @lavalamp ?

smarterclayton · 2014-12-18T22:07:41Z

Eric was going to do parts of it when he got back. My existing pull covers bound pods.

----- Original Message -----

@erictune @dchen1107 @lavalamp @smarterclayton Is anyone working on making
the Kubelet deal with pods rather than BoundPods?

Reply to this email directly or view it on GitHub:
#2483 (comment)

smarterclayton · 2014-12-19T06:00:58Z

Before we commit to a final design on how the kubelet pulls pod info, I'd like settle on how information can be added to pods for runtime that is not present in the initial pod definition. We have one concrete and two proposed scenarios

specify environment variables for service references that are bound late
- as I understand it, the proposal was to join services to pods for the host on the kubelet (all services in namespaces of the pods on this host as a minimum)
volume references to a pod (persistent disk proposal where volumes could ref an external resource)
secret or late bound environment associated with the pod (secrets associated with a service account to populate .kubernetes_auth, or additional ENV associated with a service or other resource in the namespace)

The first two can be implemented by retrieving more info from the api server. The last is an extensibility concern - there is no non racy way today to mutate the pod definition on creation and ensure the first pod start will have extra data assigned to the pod (subsequent restarts might). Admission control or an in flow mutation inside the create flow might be able to solve that, but is under specified. As we're trying very hard to avoid adding our own code around the kubelet or in front of pod submission (so w can run on top of existing kube deployments), I was hoping to get a definitive design for that use case.

smarterclayton · 2014-12-19T06:04:15Z

#846 is a stepping stone change that gets us closer to this without substantial changes to api server and kubelet.

smarterclayton · 2015-01-30T18:15:53Z

Hasn't should be. The annotations also stripped out in the Kubelet right now before the Kubelet even sees them.

----- Original Message -----

I'd like for the scheduler to be able to add metadata to a "bound pod",
however that concept ends up being realized. Currently the etcd registry
ApplyBinding() implementation drops annotations on the floor.

Has there been any discussion about how a scheduler can inject annotations
into a bound pod?

Reply to this email directly or view it on GitHub:
#2483 (comment)

bgrant0607 · 2015-02-03T05:42:40Z

@jdef Into the pod? Into the containers of the pod? That's relevant to #386.

My objective is for the full pod to be made available to kubelet.

jdef · 2015-02-03T13:22:14Z

@bgrant0607 into the pod "binding", not into the containers. If Binding and BoundPod are going away, then this means storing the binding annotations in the Pod somewhere.

Use case: a scheduler wants to add metadata into the binding that may be used on failover. For example, a recovering/restarted scheduler could query the existing "bindings" (in whatever form k8s decides to store them) and extract previously stored annotations to help rebuild internal state that was lost when the previous scheduler instance crashed. It would be very convenient to transparently leverage the existing state abstraction (registry) that kubernetes already provides.

As per the description in the Suggested Changes above:

When scheduler writes a Binding, the PodStatus.Host is be set and resourceVersion is updated.

It would be awesome if the annotations were stored atomically with the binding, as part of the etcd registry implementation:

"When the scheduler writes a Binding, the PodStatus.Host is set, additional binding annotations are saved, and the resourceVersion is updated."

erictune · 2015-02-03T22:53:07Z

Binding is not going away but is currently write only. Proposal to make
read write could be considered.
On Feb 3, 2015 5:22 AM, "James DeFelice" notifications@github.com wrote:

@bgrant0607 https://github.com/bgrant0607 into the pod "binding", not
into the containers. If Binding and BoundPod are going away, then this
means storing the binding annotations in the Pod somewhere.

Use case: a scheduler wants to add metadata into the binding that may be
used on failover. For example, a recovering/restarted scheduler could query
the existing "bindings" (in whatever form k8s decides to store them) and
extract previously stored annotations to help rebuild internal state that
was lost when the previous scheduler instance crashed. It would be very
convenient to transparently leverage the existing state abstraction
(registry) that kubernetes already provides.

As per the description in the Suggested Changes above:

When scheduler writes a Binding, the PodStatus.Host is be set and
resourceVersion is updated.

It would be awesome if the annotations were stored atomically with the
binding, as part of the etcd registry implementation:

"When the scheduler writes a Binding, the PodStatus.Host is set, additional
binding annotations are saved, and the resourceVersion is updated."

—
Reply to this email directly or view it on GitHub
#2483 (comment)
.

jdef · 2015-02-04T00:22:00Z

OK, so Binding is staying but BoundPod is going away.

What do you think about storing binding-related annotations in a new field: Pod.Status.Annotations?

Whenever Pod.Status.Host is cleared, binding annotations are cleared
When setting Pod.Status.Host="{something}" via assignPod(), binding annotations are copied in from Binding
When cloning Pod.Status.Host in UpdatePod(), the Pod.Status.Annotations are also cloned

I don't think /bindings has to be a readable endpoint, as long as the binding metadata is exposed some other way. I like having changes to the binding annotations closely track changes to the Host: Pod.Status.Host seems relatively immutable, except at binding (and delete) time -- binding annotations could be the same.

smarterclayton · 2015-02-04T01:10:54Z

Can we simply namespace the annotations onto the pod?

On Feb 3, 2015, at 7:22 PM, James DeFelice notifications@github.com wrote:

OK, so Binding is staying but BoundPod is going away.

What do you think about storing binding-related annotations in a new field: Pod.Status.Annotations?

Whenever Pod.Status.Host is cleared, binding annotations are cleared
When setting Pod.Status.Host="{something}" via assignPod(), binding annotations are copied in from Binding
When cloning Pod.Status.Host in UpdatePod(), the Pod.Status.Annotations are also cloned
I don't think /bindings has to be a readable endpoint, as long as the binding metadata is exposed some other way. I like having changes to the binding annotations closely track changes to the Host: Pod.Status.Host seems relatively immutable, except at binding (and delete) time -- binding annotations could be the same.

—
Reply to this email directly or view it on GitHub.

bgrant0607 · 2015-02-04T01:17:04Z

Note, this is really off topic for this issue. However...

@smarterclayton Not sure what you mean "namespace onto". I would namespace annotations attached via an automation component.

@jdef I'm not opposed to this idea, though I'd probably want to distinguish annotations on the Binding vs. those added to the pod, but do you have a concrete example of the information you'd want to record in pod annotations? If sent via Binding, I'm not sure exactly how they'd be of use to the scheduler later. By definition, once scheduled, the scheduler is done with that pod.

BTW, once set, Pod.Status.Host will never be cleared. Pods are created, scheduled, and terminate. See #3949 for more details.

smarterclayton · 2015-02-04T01:28:05Z

On Feb 3, 2015, at 8:17 PM, Brian Grant notifications@github.com wrote:

Note, this is really off topic for this issue. However...

@smarterclayton Not sure what you mean "namespace onto". I would namespace annotations attached via an automation component.

Binding annotation "foo=bar" -> pod annotation "binding/foo=bar"
@jdef I'm not opposed to this idea, though I'd probably want to distinguish annotations on the Binding vs. those added to the pod, but do you have a concrete example of the information you'd want to record in pod annotations? If sent via Binding, I'm not sure exactly how they'd be of use to the scheduler later. By definition, once scheduled, the scheduler is done with that pod.

BTW, once set, Pod.Status.Host will never be cleared. Pods are created, scheduled, and terminate. See #3949 for more details.

—
Reply to this email directly or view it on GitHub.

jdef · 2015-02-04T02:06:08Z

@bgrant0607 I'm thinking about how to implement mesos task reconciliation in the kubernetes-mesos scheduler. If it crashes and then restarts, I want to be able to recover things like taskID, slaveID, offerID, etc. so as to rebuild the internal state of the scheduler. Once state is recovered, I can reconcile the against the tasks that the mesos master knows about.

If I can't use binding annotations for this, then I need to store state some other (super inconvenient) way.

Pod.Status.Host should never be cleared, but it looks like there's still a hack that performs exactly that in some error handler:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/registry/etcd/etcd.go#L226

jdef · 2015-02-04T02:59:11Z

@smarterclayton the "namespacing" approach would work for me, though it seems like it would be more effort to roll-back a merge of annotation bindings if an error strikes. Also, what if a binding k/v pair is named "alpha/beta=gamma" which would translate to "binding/alpha/beta=gamma" -- is that even allowed?

bgrant0607 · 2015-02-04T03:36:45Z

@jdef Thanks for the use case. Yes, I agree we should support such annotations. Please file a separate issue with the example copy/pasted. As for namespaces, I think we should allow the client to specify the label namespaces. We should strongly encourage such clients to be well behaved. Since this would be a cluster service, it doesn't seem like an unreasonable imposition/assumption.

The hack you mention should be resolved by this issue, since the failure mode will be eliminated when we eliminate BoundPods.

smarterclayton · 2015-02-27T23:45:37Z

How are atomic scheduling constraints going to be modeled?

lavalamp · 2015-02-28T00:07:48Z

kubelet will do the constraint check(s) and reject offending pods.

smarterclayton · 2015-02-28T00:18:02Z

So the status would be updated to "rejected" or similar? Is rejection temporary or permanent (ie should rcs give up immediately and delete the pod)?

On Feb 27, 2015, at 7:08 PM, Daniel Smith notifications@github.com wrote:

kubelet will do the constraint check(s) and reject offending pods.

—
Reply to this email directly or view it on GitHub.

erictune · 2015-02-28T00:37:47Z

rcs should retry, but maybe back off on the rate after a while.

If the is on a single node with an apiserver-based,.

If the port conflict is with an existing apiserver-based pod, then the
scheduler will probably notice the node conflict next time and find a
different node that works (or leave it pending if there is no node that
works).

If the port conflict is with an existing file-based pod, which the
schedulers don't know about (yet), then the behavior is also similar to
today, but we will be able to fix it once file-based pods are surfaced
(Read-only) to the apiserver.

On Fri, Feb 27, 2015 at 4:18 PM, Clayton Coleman notifications@github.com
wrote:

So the status would be updated to "rejected" or similar? Is rejection
temporary or permanent (ie should rcs give up immediately and delete the
pod)?

On Feb 27, 2015, at 7:08 PM, Daniel Smith notifications@github.com
wrote:

kubelet will do the constraint check(s) and reject offending pods.

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
#2483 (comment)
.

bgrant0607 · 2015-02-28T01:35:24Z

Similar to discussion re. out of disk, OOM, host port conflict, etc., the pod should just be failed, with a specific reason.

Whenever a pod's Spec.Host field is set (at creation time or during binding), the fit predicates should be tested. The update should fail if they don't pass at that time, as a best-effort check. At least policy-style checks should be applied (e.g., sole tenancy).

However, there's no significant difference between a bad scheduling decision based on stale data and a change in conditions on the host shortly after the decision is made that results in failure. The scheduler needs to be resilient to such failures. There are an unbounded number of reasons why they can happen. For instance, hosts can be broken in subtle ways that only affect containers using certain devices or other system resources or features (e.g., perhaps pulls of new images could be broken but existing images would work). New pods could be started directly on the Kubelets. Existing containers could launch new ones, or otherwise start to consume a lot more resources. Currently we don't have differentiated quality of service, so nothing stops launching new pods that don't specify resource limits (unless using the admission controller, but even then we don't have cpu or disk enforcement). Containers/processes that are supposed to be dead could be unkillable, preventing reuse of their resources. The kernel could consume an unexpectedly large amount of memory and never give it back. The host could disappear. On at least physical machines, individual (non-root) disks could fail. 2 containers could massively conflict in the cache, reducing performance by 10x.

Omega implemented scheduling atomicity at the storage level. It's not worth it.

We're also far from being in a position where it would be important. For example, we're neither replicating apiserver nor running multiple schedulers at the moment.

Just as the node controller kills pods on nodes that are unresponsive, it could (reactively) remove pods from overcommitted nodes.

Eventually there will be a lot of churn, both on individual machines and in the cluster as a whole:

horizontal auto-scaling
vertical auto-sizing
automatic pushes
batch, CI, and workflow workloads
automated node maintenance (e.g., kernel updates)

We're going to need a background (proactive) self-healing controller to rebalance pods, which I think of as similar to a generational garbage collector.

erictune · 2015-03-03T16:21:07Z

agree with everything @bgrant0607 just said

bgrant0607 · 2015-03-10T23:42:59Z

BoundPods is hypothesized to be gating pod scheduling and deletion performance, and perhaps is causing pathologically bad resource conflicts resulting in high latency and apiserver cpu consumption. #3954, #4521.

erictune · 2015-03-12T18:11:06Z

Closing in favor of targeted issues:

Kubelet config/file should read Pod Kubelet config/file should read Pod #3372
Kubelet to use Pods internally. Kubelet to use Pods internally. #5392

erictune mentioned this issue Nov 20, 2014

Kubelet to watch boundPods #2484

Closed

erictune mentioned this issue Nov 25, 2014

Parameterize registry path #2581

Closed

erictune mentioned this issue Dec 2, 2014

Not possible to easily retrieve all events for a specific pod by UID #2715

Closed

goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Dec 3, 2014

bgrant0607 added area/kubelet-api area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Dec 16, 2014

This was referenced Dec 16, 2014

boundPods can get out of sync #2794

Closed

Remove need for the apiserver to contact kubelet for current container state #156

Closed

BoundPods should have field -- ListMeta, instead ObjectMeta. #2089

Closed

erictune mentioned this issue Dec 30, 2014

Secure node -> master communication #3168

Closed

a-robinson mentioned this issue Jan 2, 2015

WIP: Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one #3174

Closed

jdef mentioned this issue Jan 30, 2015

implement task reconciliation d2iq-archive/kubernetes-mesos#73

Closed

bgrant0607 mentioned this issue Feb 4, 2015

Properly set object status #2726

Closed

jdef mentioned this issue Feb 4, 2015

Binding annotations should be preserved alongside Host assignment #4103

Closed

dchen1107 added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 4, 2015

dchen1107 added this to the v1.0 milestone Feb 18, 2015

bgrant0607 mentioned this issue Feb 27, 2015

Pluggable storage backends (was Support for Consul K/V storage) #1957

Closed

bgrant0607 mentioned this issue Feb 28, 2015

Kubelet to fail pods that have hostPort conflicts, etc #4623

Closed

This was referenced Mar 5, 2015

Pods running on a system that kubernetes doesn't know about after replication controller delete #5020

Closed

Scheduler spreading using services vs replication controllers #4971

Closed

bgrant0607 mentioned this issue Mar 11, 2015

Stop updating boundPods objects. #5320

Merged

erictune closed this as completed Mar 12, 2015

Kubelet to understand pods, and to be able to pull from apiserver #2483

Kubelet to understand pods, and to be able to pull from apiserver #2483

Comments

erictune commented Nov 20, 2014

erictune commented Nov 20, 2014

erictune commented Nov 20, 2014

lavalamp commented Nov 20, 2014

erictune commented Nov 20, 2014

thockin commented Nov 20, 2014

erictune commented Nov 20, 2014

lavalamp commented Nov 20, 2014

erictune commented Nov 25, 2014

erictune commented Nov 26, 2014

erictune commented Dec 2, 2014

erictune commented Dec 2, 2014

bgrant0607 commented Dec 18, 2014

bgrant0607 commented Dec 18, 2014

dchen1107 commented Dec 18, 2014

smarterclayton commented Dec 18, 2014

smarterclayton commented Dec 19, 2014

smarterclayton commented Dec 19, 2014

smarterclayton commented Jan 30, 2015

bgrant0607 commented Feb 3, 2015

jdef commented Feb 3, 2015

erictune commented Feb 3, 2015

jdef commented Feb 4, 2015

smarterclayton commented Feb 4, 2015

bgrant0607 commented Feb 4, 2015

smarterclayton commented Feb 4, 2015

jdef commented Feb 4, 2015

jdef commented Feb 4, 2015

bgrant0607 commented Feb 4, 2015

smarterclayton commented Feb 27, 2015

lavalamp commented Feb 28, 2015

smarterclayton commented Feb 28, 2015

erictune commented Feb 28, 2015

bgrant0607 commented Feb 28, 2015

erictune commented Mar 3, 2015

bgrant0607 commented Mar 10, 2015

erictune commented Mar 12, 2015