Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet to understand pods, and to be able to pull from apiserver #2483

Closed
erictune opened this issue Nov 20, 2014 · 47 comments
Closed

Kubelet to understand pods, and to be able to pull from apiserver #2483

erictune opened this issue Nov 20, 2014 · 47 comments
Assignees
Labels
area/kubelet area/kubelet-api priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@erictune
Copy link
Member

Several inter-related goals:

  1. Kubelet continues to be usable in isolation, as well as in a Kubernetes cluster. (composability)
  2. Config defined in terms of Pods should work both with an apiserver, and directly with Kubelets (bootstraping, debugging, composability)
  3. Kubelet's can only see pods that they need to see (security)
  4. Don't create more apiserver API objects than necessary (ease of understanding)
  5. Node etcd need not be connected to master etcd. (security, isolation, scalability)
  6. Not have clients (be they kubectl, or kubelets or whatever) depend on the storage layer of the apiserver. (abstraction, allowing multiple implementations of the API)

Current state:

  1. Kubelet can read container manifest from a local file. Or from an URL. Used by Google ContainerVM.
  2. Kubelets can read boundPods from their local etcd, which is connected to the master machine's etcd.
  3. Kubelets can talk to the apiserver, and do write /events. They don't read /pods or /boundPods.

Suggested changes:

  1. Kubelet learns the definition of type Pods.
  2. Kubelet can read json pod definitions from local files or URLs.
    • It also still supports containerManifests for the foreseeable future, with some way for it to determine which type to expect from a source.
  3. In a "typical" Kubernetes cluster, a Kubelet watches /api/v1beta3/pods for what pods it should run.
  4. Get rid of BoundPod and BoundPods since nothing reads them anymore.
  5. Add a Host field to the PodStatus (but not PodSpec).
  6. When scheduler writes a Binding, the PodStatus.Host is be set and resourceVersion is updated.
  7. A cluster is bootstrapped by first starting one or more VMs with,

Concerns that may be raised and responses:

Q1: Are changes to the set of pods bound made by the scheduler machine atomic or eventually consistent?

A1: It could work either way. If we want atomic behavior, we could implement that in apiserver more readily that we could when we directly expose our storage via etcd.

Q2: Should the kubelet be allowed to see CurrentState (now PodStatus in v1beta3)? It generates (some of) the status, so why let it see that?

A2: We could implement this if it is important. Kubelet would watch pods with a selector that matches only pods with PodStatus.Host == kubelets hostname, and could use a field selector so that only PodSpec and not PodStatus is returned.

Q3: How do we prevent Kubelets from seeing other node's pods.
A3: There are a couple of ways I can think to do this with small changes to our current authorization policy.

  1. One way is to have a distinguished "kubelet" user and special case its authorization.
  2. Another is to create a new policy as each "kubelet" is added which matches on the SourceIP of the request, and requires a selector to be part of the request which selects on "PodStatus.Host=$SourceIP". This may make some assumptions about the network security of the cluster, but seems like it could work.
  3. A variation on the previous is to only have one line of policy for all kubelets, but have a "condition" field in the policy that checks that PodStatus.Host matches SourceIP.
  4. Another variation is to have a different token for each kubelet and a separate policy for each.
@erictune
Copy link
Member Author

@a-robinson @roberthbailey
I think both of you were interested in how to remove the dependency between master etcd and kubelet. This proposal, if implemented, would answer that question.

@erictune
Copy link
Member Author

@thockin
I know you had some opinions about BoundPods. But, since we last discussed that, there have been changes to authorization, kubelet-to-apiserver communication (events), and the restructuring of desiredState/currentState to PodSpec/PodStatus (v1beta3).

@lavalamp
Copy link
Member

  1. In a "typical" Kubernetes cluster, a Kubelet watches /api/v1beta3/pods for what pods it should run.
  2. Get rid of BoundPod and BoundPods since nothing reads them anymore.

This is going to be controversial. For expediency's sake, I would like to start out with kubelet watching /api/v1beta3/boundPods. The boundPods/pods argument can be had later.

@erictune
Copy link
Member Author

Filed #2484 @lavalamp

@thockin
Copy link
Member

thockin commented Nov 20, 2014

The goal of BoundPods was to simplify Kubelet and the kubelet experience by
having it only understand what it needs to know in order to do its job.

I'm not against teaching Kubelet about Pods instead of BoundPods, but a
number of questions (as you called out) remain.

Whatever watch the kubelet does probably has to be scalable - we don't want
it waking up every time any pod in a cluster is created/destroyed/updated.

I dislike the magic that happens between scheduler and apiserver with
Bindings - it is a pet belief of mine (in fact I satrted fiddling with it
on the plane) that the REST objects should be less entangled than they are.

On Wed, Nov 19, 2014 at 5:30 PM, Eric Tune notifications@github.com wrote:

Filed #2484
#2484 @lavalamp
https://github.com/lavalamp

Reply to this email directly or view it on GitHub
#2483 (comment)
.

@erictune
Copy link
Member Author

The selector on the watch which I described should prevent excessive kubelet wakeups.

An improvement over the current situation with Bindings might be to replace:

POST 
/api/v1beta3/bindings 
{ ...
"podID": "...",
"host": "....",
}

with a body-less POST to a special verb, like this:

POST
/api/v1beta3/bind/pods/mypod?host="...."

That would mean we could delete type Bindings from pkg/api/types.go, which seems like it would address your concern about entangled REST objects.

@lavalamp
Copy link
Member

with a body-less POST to a special verb

That's much harder to change in the future if we want to, say, add a field. (Imagine the scheduler wanting to claim ssd0 for the pod or something.)

@erictune
Copy link
Member Author

@thockin points out that I need to change kube-proxy to watch /endpoints.

@erictune
Copy link
Member Author

Just realized that this is a dup of #860

@erictune
Copy link
Member Author

erictune commented Dec 2, 2014

Related discussion happening on #846

@erictune
Copy link
Member Author

erictune commented Dec 2, 2014

More motivation for this change: #2715

@goltermann goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Dec 3, 2014
@bgrant0607 bgrant0607 added area/kubelet-api area/kubelet priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Dec 16, 2014
@bgrant0607
Copy link
Member

Yet another motivation to get rid of BoundPods: We want to allow users to specify the host directly -- #3020.

/cc @brendandburns

@bgrant0607
Copy link
Member

@erictune @dchen1107 @lavalamp @smarterclayton Is anyone working on making the Kubelet deal with pods rather than BoundPods?

@dchen1107
Copy link
Member

I am not at this moment, but if there is no one working on it. I can pick up this tomorrow. @lavalamp ?

@smarterclayton
Copy link
Contributor

Eric was going to do parts of it when he got back. My existing pull covers bound pods.

----- Original Message -----

@erictune @dchen1107 @lavalamp @smarterclayton Is anyone working on making
the Kubelet deal with pods rather than BoundPods?


Reply to this email directly or view it on GitHub:
#2483 (comment)

@smarterclayton
Copy link
Contributor

Before we commit to a final design on how the kubelet pulls pod info, I'd like settle on how information can be added to pods for runtime that is not present in the initial pod definition. We have one concrete and two proposed scenarios

  • specify environment variables for service references that are bound late
    • as I understand it, the proposal was to join services to pods for the host on the kubelet (all services in namespaces of the pods on this host as a minimum)
  • volume references to a pod (persistent disk proposal where volumes could ref an external resource)
  • secret or late bound environment associated with the pod (secrets associated with a service account to populate .kubernetes_auth, or additional ENV associated with a service or other resource in the namespace)

The first two can be implemented by retrieving more info from the api server. The last is an extensibility concern - there is no non racy way today to mutate the pod definition on creation and ensure the first pod start will have extra data assigned to the pod (subsequent restarts might). Admission control or an in flow mutation inside the create flow might be able to solve that, but is under specified. As we're trying very hard to avoid adding our own code around the kubelet or in front of pod submission (so w can run on top of existing kube deployments), I was hoping to get a definitive design for that use case.

@smarterclayton
Copy link
Contributor

#846 is a stepping stone change that gets us closer to this without substantial changes to api server and kubelet.

@smarterclayton
Copy link
Contributor

Hasn't should be. The annotations also stripped out in the Kubelet right now before the Kubelet even sees them.

----- Original Message -----

I'd like for the scheduler to be able to add metadata to a "bound pod",
however that concept ends up being realized. Currently the etcd registry
ApplyBinding() implementation drops annotations on the floor.

Has there been any discussion about how a scheduler can inject annotations
into a bound pod?


Reply to this email directly or view it on GitHub:
#2483 (comment)

@bgrant0607
Copy link
Member

@jdef Into the pod? Into the containers of the pod? That's relevant to #386.

My objective is for the full pod to be made available to kubelet.

@jdef
Copy link
Contributor

jdef commented Feb 3, 2015

@bgrant0607 into the pod "binding", not into the containers. If Binding and BoundPod are going away, then this means storing the binding annotations in the Pod somewhere.

Use case: a scheduler wants to add metadata into the binding that may be used on failover. For example, a recovering/restarted scheduler could query the existing "bindings" (in whatever form k8s decides to store them) and extract previously stored annotations to help rebuild internal state that was lost when the previous scheduler instance crashed. It would be very convenient to transparently leverage the existing state abstraction (registry) that kubernetes already provides.

As per the description in the Suggested Changes above:

When scheduler writes a Binding, the PodStatus.Host is be set and resourceVersion is updated.

It would be awesome if the annotations were stored atomically with the binding, as part of the etcd registry implementation:

"When the scheduler writes a Binding, the PodStatus.Host is set, additional binding annotations are saved, and the resourceVersion is updated."

@erictune
Copy link
Member Author

erictune commented Feb 3, 2015

Binding is not going away but is currently write only. Proposal to make
read write could be considered.
On Feb 3, 2015 5:22 AM, "James DeFelice" notifications@github.com wrote:

@bgrant0607 https://github.com/bgrant0607 into the pod "binding", not
into the containers. If Binding and BoundPod are going away, then this
means storing the binding annotations in the Pod somewhere.

Use case: a scheduler wants to add metadata into the binding that may be
used on failover. For example, a recovering/restarted scheduler could query
the existing "bindings" (in whatever form k8s decides to store them) and
extract previously stored annotations to help rebuild internal state that
was lost when the previous scheduler instance crashed. It would be very
convenient to transparently leverage the existing state abstraction
(registry) that kubernetes already provides.

As per the description in the Suggested Changes above:

When scheduler writes a Binding, the PodStatus.Host is be set and
resourceVersion is updated.

It would be awesome if the annotations were stored atomically with the
binding, as part of the etcd registry implementation:

"When the scheduler writes a Binding, the PodStatus.Host is set, additional
binding annotations are saved
, and the resourceVersion is updated."


Reply to this email directly or view it on GitHub
#2483 (comment)
.

@jdef
Copy link
Contributor

jdef commented Feb 4, 2015

OK, so Binding is staying but BoundPod is going away.

What do you think about storing binding-related annotations in a new field: Pod.Status.Annotations?

  • Whenever Pod.Status.Host is cleared, binding annotations are cleared
  • When setting Pod.Status.Host="{something}" via assignPod(), binding annotations are copied in from Binding
  • When cloning Pod.Status.Host in UpdatePod(), the Pod.Status.Annotations are also cloned

I don't think /bindings has to be a readable endpoint, as long as the binding metadata is exposed some other way. I like having changes to the binding annotations closely track changes to the Host: Pod.Status.Host seems relatively immutable, except at binding (and delete) time -- binding annotations could be the same.

@smarterclayton
Copy link
Contributor

Can we simply namespace the annotations onto the pod?

On Feb 3, 2015, at 7:22 PM, James DeFelice notifications@github.com wrote:

OK, so Binding is staying but BoundPod is going away.

What do you think about storing binding-related annotations in a new field: Pod.Status.Annotations?

Whenever Pod.Status.Host is cleared, binding annotations are cleared
When setting Pod.Status.Host="{something}" via assignPod(), binding annotations are copied in from Binding
When cloning Pod.Status.Host in UpdatePod(), the Pod.Status.Annotations are also cloned
I don't think /bindings has to be a readable endpoint, as long as the binding metadata is exposed some other way. I like having changes to the binding annotations closely track changes to the Host: Pod.Status.Host seems relatively immutable, except at binding (and delete) time -- binding annotations could be the same.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

Note, this is really off topic for this issue. However...

@smarterclayton Not sure what you mean "namespace onto". I would namespace annotations attached via an automation component.

@jdef I'm not opposed to this idea, though I'd probably want to distinguish annotations on the Binding vs. those added to the pod, but do you have a concrete example of the information you'd want to record in pod annotations? If sent via Binding, I'm not sure exactly how they'd be of use to the scheduler later. By definition, once scheduled, the scheduler is done with that pod.

BTW, once set, Pod.Status.Host will never be cleared. Pods are created, scheduled, and terminate. See #3949 for more details.

@smarterclayton
Copy link
Contributor

On Feb 3, 2015, at 8:17 PM, Brian Grant notifications@github.com wrote:

Note, this is really off topic for this issue. However...

@smarterclayton Not sure what you mean "namespace onto". I would namespace annotations attached via an automation component.

Binding annotation "foo=bar" -> pod annotation "binding/foo=bar"
@jdef I'm not opposed to this idea, though I'd probably want to distinguish annotations on the Binding vs. those added to the pod, but do you have a concrete example of the information you'd want to record in pod annotations? If sent via Binding, I'm not sure exactly how they'd be of use to the scheduler later. By definition, once scheduled, the scheduler is done with that pod.

BTW, once set, Pod.Status.Host will never be cleared. Pods are created, scheduled, and terminate. See #3949 for more details.


Reply to this email directly or view it on GitHub.

@jdef
Copy link
Contributor

jdef commented Feb 4, 2015

@bgrant0607 I'm thinking about how to implement mesos task reconciliation in the kubernetes-mesos scheduler. If it crashes and then restarts, I want to be able to recover things like taskID, slaveID, offerID, etc. so as to rebuild the internal state of the scheduler. Once state is recovered, I can reconcile the against the tasks that the mesos master knows about.

If I can't use binding annotations for this, then I need to store state some other (super inconvenient) way.

Pod.Status.Host should never be cleared, but it looks like there's still a hack that performs exactly that in some error handler:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/registry/etcd/etcd.go#L226

@jdef
Copy link
Contributor

jdef commented Feb 4, 2015

@smarterclayton the "namespacing" approach would work for me, though it seems like it would be more effort to roll-back a merge of annotation bindings if an error strikes. Also, what if a binding k/v pair is named "alpha/beta=gamma" which would translate to "binding/alpha/beta=gamma" -- is that even allowed?

@bgrant0607
Copy link
Member

@jdef Thanks for the use case. Yes, I agree we should support such annotations. Please file a separate issue with the example copy/pasted. As for namespaces, I think we should allow the client to specify the label namespaces. We should strongly encourage such clients to be well behaved. Since this would be a cluster service, it doesn't seem like an unreasonable imposition/assumption.

The hack you mention should be resolved by this issue, since the failure mode will be eliminated when we eliminate BoundPods.

@smarterclayton
Copy link
Contributor

How are atomic scheduling constraints going to be modeled?

@lavalamp
Copy link
Member

kubelet will do the constraint check(s) and reject offending pods.

@smarterclayton
Copy link
Contributor

So the status would be updated to "rejected" or similar? Is rejection temporary or permanent (ie should rcs give up immediately and delete the pod)?

On Feb 27, 2015, at 7:08 PM, Daniel Smith notifications@github.com wrote:

kubelet will do the constraint check(s) and reject offending pods.


Reply to this email directly or view it on GitHub.

@erictune
Copy link
Member Author

rcs should retry, but maybe back off on the rate after a while.

If the is on a single node with an apiserver-based,.

If the port conflict is with an existing apiserver-based pod, then the
scheduler will probably notice the node conflict next time and find a
different node that works (or leave it pending if there is no node that
works).

If the port conflict is with an existing file-based pod, which the
schedulers don't know about (yet), then the behavior is also similar to
today, but we will be able to fix it once file-based pods are surfaced
(Read-only) to the apiserver.

On Fri, Feb 27, 2015 at 4:18 PM, Clayton Coleman notifications@github.com
wrote:

So the status would be updated to "rejected" or similar? Is rejection
temporary or permanent (ie should rcs give up immediately and delete the
pod)?

On Feb 27, 2015, at 7:08 PM, Daniel Smith notifications@github.com
wrote:

kubelet will do the constraint check(s) and reject offending pods.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#2483 (comment)
.

@bgrant0607
Copy link
Member

Similar to discussion re. out of disk, OOM, host port conflict, etc., the pod should just be failed, with a specific reason.

Whenever a pod's Spec.Host field is set (at creation time or during binding), the fit predicates should be tested. The update should fail if they don't pass at that time, as a best-effort check. At least policy-style checks should be applied (e.g., sole tenancy).

However, there's no significant difference between a bad scheduling decision based on stale data and a change in conditions on the host shortly after the decision is made that results in failure. The scheduler needs to be resilient to such failures. There are an unbounded number of reasons why they can happen. For instance, hosts can be broken in subtle ways that only affect containers using certain devices or other system resources or features (e.g., perhaps pulls of new images could be broken but existing images would work). New pods could be started directly on the Kubelets. Existing containers could launch new ones, or otherwise start to consume a lot more resources. Currently we don't have differentiated quality of service, so nothing stops launching new pods that don't specify resource limits (unless using the admission controller, but even then we don't have cpu or disk enforcement). Containers/processes that are supposed to be dead could be unkillable, preventing reuse of their resources. The kernel could consume an unexpectedly large amount of memory and never give it back. The host could disappear. On at least physical machines, individual (non-root) disks could fail. 2 containers could massively conflict in the cache, reducing performance by 10x.

Omega implemented scheduling atomicity at the storage level. It's not worth it.

We're also far from being in a position where it would be important. For example, we're neither replicating apiserver nor running multiple schedulers at the moment.

Just as the node controller kills pods on nodes that are unresponsive, it could (reactively) remove pods from overcommitted nodes.

Eventually there will be a lot of churn, both on individual machines and in the cluster as a whole:

  • horizontal auto-scaling
  • vertical auto-sizing
  • automatic pushes
  • batch, CI, and workflow workloads
  • automated node maintenance (e.g., kernel updates)

We're going to need a background (proactive) self-healing controller to rebalance pods, which I think of as similar to a generational garbage collector.

@erictune
Copy link
Member Author

erictune commented Mar 3, 2015

agree with everything @bgrant0607 just said

@bgrant0607
Copy link
Member

BoundPods is hypothesized to be gating pod scheduling and deletion performance, and perhaps is causing pathologically bad resource conflicts resulting in high latency and apiserver cpu consumption. #3954, #4521.

@erictune
Copy link
Member Author

Closing in favor of targeted issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/kubelet-api priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

9 participants