-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubelet to understand pods, and to be able to pull from apiserver #2483
Comments
@a-robinson @roberthbailey |
@thockin |
This is going to be controversial. For expediency's sake, I would like to start out with kubelet watching /api/v1beta3/boundPods. The boundPods/pods argument can be had later. |
The goal of BoundPods was to simplify Kubelet and the kubelet experience by I'm not against teaching Kubelet about Pods instead of BoundPods, but a Whatever watch the kubelet does probably has to be scalable - we don't want I dislike the magic that happens between scheduler and apiserver with On Wed, Nov 19, 2014 at 5:30 PM, Eric Tune notifications@github.com wrote:
|
The selector on the watch which I described should prevent excessive kubelet wakeups. An improvement over the current situation with Bindings might be to replace:
with a body-less POST to a special verb, like this:
That would mean we could delete |
That's much harder to change in the future if we want to, say, add a field. (Imagine the scheduler wanting to claim ssd0 for the pod or something.) |
@thockin points out that I need to change kube-proxy to watch /endpoints. |
Just realized that this is a dup of #860 |
Related discussion happening on #846 |
More motivation for this change: #2715 |
Yet another motivation to get rid of BoundPods: We want to allow users to specify the host directly -- #3020. /cc @brendandburns |
@erictune @dchen1107 @lavalamp @smarterclayton Is anyone working on making the Kubelet deal with pods rather than BoundPods? |
I am not at this moment, but if there is no one working on it. I can pick up this tomorrow. @lavalamp ? |
Eric was going to do parts of it when he got back. My existing pull covers bound pods. ----- Original Message -----
|
Before we commit to a final design on how the kubelet pulls pod info, I'd like settle on how information can be added to pods for runtime that is not present in the initial pod definition. We have one concrete and two proposed scenarios
The first two can be implemented by retrieving more info from the api server. The last is an extensibility concern - there is no non racy way today to mutate the pod definition on creation and ensure the first pod start will have extra data assigned to the pod (subsequent restarts might). Admission control or an in flow mutation inside the create flow might be able to solve that, but is under specified. As we're trying very hard to avoid adding our own code around the kubelet or in front of pod submission (so w can run on top of existing kube deployments), I was hoping to get a definitive design for that use case. |
#846 is a stepping stone change that gets us closer to this without substantial changes to api server and kubelet. |
Hasn't should be. The annotations also stripped out in the Kubelet right now before the Kubelet even sees them. ----- Original Message -----
|
@bgrant0607 into the pod "binding", not into the containers. If Binding and BoundPod are going away, then this means storing the binding annotations in the Pod somewhere. Use case: a scheduler wants to add metadata into the binding that may be used on failover. For example, a recovering/restarted scheduler could query the existing "bindings" (in whatever form k8s decides to store them) and extract previously stored annotations to help rebuild internal state that was lost when the previous scheduler instance crashed. It would be very convenient to transparently leverage the existing state abstraction (registry) that kubernetes already provides. As per the description in the Suggested Changes above:
It would be awesome if the annotations were stored atomically with the binding, as part of the etcd registry implementation: "When the scheduler writes a Binding, the PodStatus.Host is set, additional binding annotations are saved, and the resourceVersion is updated." |
Binding is not going away but is currently write only. Proposal to make
|
OK, so Binding is staying but BoundPod is going away. What do you think about storing binding-related annotations in a new field: Pod.Status.Annotations?
I don't think /bindings has to be a readable endpoint, as long as the binding metadata is exposed some other way. I like having changes to the binding annotations closely track changes to the Host: Pod.Status.Host seems relatively immutable, except at binding (and delete) time -- binding annotations could be the same. |
Can we simply namespace the annotations onto the pod?
|
Note, this is really off topic for this issue. However... @smarterclayton Not sure what you mean "namespace onto". I would namespace annotations attached via an automation component. @jdef I'm not opposed to this idea, though I'd probably want to distinguish annotations on the Binding vs. those added to the pod, but do you have a concrete example of the information you'd want to record in pod annotations? If sent via Binding, I'm not sure exactly how they'd be of use to the scheduler later. By definition, once scheduled, the scheduler is done with that pod. BTW, once set, Pod.Status.Host will never be cleared. Pods are created, scheduled, and terminate. See #3949 for more details. |
|
@bgrant0607 I'm thinking about how to implement mesos task reconciliation in the kubernetes-mesos scheduler. If it crashes and then restarts, I want to be able to recover things like taskID, slaveID, offerID, etc. so as to rebuild the internal state of the scheduler. Once state is recovered, I can reconcile the against the tasks that the mesos master knows about. If I can't use binding annotations for this, then I need to store state some other (super inconvenient) way. Pod.Status.Host should never be cleared, but it looks like there's still a hack that performs exactly that in some error handler: |
@smarterclayton the "namespacing" approach would work for me, though it seems like it would be more effort to roll-back a merge of annotation bindings if an error strikes. Also, what if a binding k/v pair is named "alpha/beta=gamma" which would translate to "binding/alpha/beta=gamma" -- is that even allowed? |
@jdef Thanks for the use case. Yes, I agree we should support such annotations. Please file a separate issue with the example copy/pasted. As for namespaces, I think we should allow the client to specify the label namespaces. We should strongly encourage such clients to be well behaved. Since this would be a cluster service, it doesn't seem like an unreasonable imposition/assumption. The hack you mention should be resolved by this issue, since the failure mode will be eliminated when we eliminate BoundPods. |
How are atomic scheduling constraints going to be modeled? |
kubelet will do the constraint check(s) and reject offending pods. |
So the status would be updated to "rejected" or similar? Is rejection temporary or permanent (ie should rcs give up immediately and delete the pod)?
|
rcs should retry, but maybe back off on the rate after a while. If the is on a single node with an apiserver-based,. If the port conflict is with an existing apiserver-based pod, then the If the port conflict is with an existing file-based pod, which the On Fri, Feb 27, 2015 at 4:18 PM, Clayton Coleman notifications@github.com
|
Similar to discussion re. out of disk, OOM, host port conflict, etc., the pod should just be failed, with a specific reason. Whenever a pod's Spec.Host field is set (at creation time or during binding), the fit predicates should be tested. The update should fail if they don't pass at that time, as a best-effort check. At least policy-style checks should be applied (e.g., sole tenancy). However, there's no significant difference between a bad scheduling decision based on stale data and a change in conditions on the host shortly after the decision is made that results in failure. The scheduler needs to be resilient to such failures. There are an unbounded number of reasons why they can happen. For instance, hosts can be broken in subtle ways that only affect containers using certain devices or other system resources or features (e.g., perhaps pulls of new images could be broken but existing images would work). New pods could be started directly on the Kubelets. Existing containers could launch new ones, or otherwise start to consume a lot more resources. Currently we don't have differentiated quality of service, so nothing stops launching new pods that don't specify resource limits (unless using the admission controller, but even then we don't have cpu or disk enforcement). Containers/processes that are supposed to be dead could be unkillable, preventing reuse of their resources. The kernel could consume an unexpectedly large amount of memory and never give it back. The host could disappear. On at least physical machines, individual (non-root) disks could fail. 2 containers could massively conflict in the cache, reducing performance by 10x. Omega implemented scheduling atomicity at the storage level. It's not worth it. We're also far from being in a position where it would be important. For example, we're neither replicating apiserver nor running multiple schedulers at the moment. Just as the node controller kills pods on nodes that are unresponsive, it could (reactively) remove pods from overcommitted nodes. Eventually there will be a lot of churn, both on individual machines and in the cluster as a whole:
We're going to need a background (proactive) self-healing controller to rebalance pods, which I think of as similar to a generational garbage collector. |
agree with everything @bgrant0607 just said |
Closing in favor of targeted issues:
|
Several inter-related goals:
Current state:
Suggested changes:
Concerns that may be raised and responses:
Q1: Are changes to the set of pods bound made by the scheduler machine atomic or eventually consistent?
A1: It could work either way. If we want atomic behavior, we could implement that in apiserver more readily that we could when we directly expose our storage via etcd.
Q2: Should the kubelet be allowed to see CurrentState (now PodStatus in v1beta3)? It generates (some of) the status, so why let it see that?
A2: We could implement this if it is important. Kubelet would watch pods with a selector that matches only pods with PodStatus.Host == kubelets hostname, and could use a field selector so that only PodSpec and not PodStatus is returned.
Q3: How do we prevent Kubelets from seeing other node's pods.
A3: There are a couple of ways I can think to do this with small changes to our current authorization policy.
The text was updated successfully, but these errors were encountered: