-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepare for external scheduler #557
Conversation
I'd like @bgrant0607 to take a look at the PodStatuses I'm adding. |
// a pod would get this status after having had status "New" because a scheduler decided | ||
// that the system doesn't have a space for the pod. Pods that were previously "Admitted" | ||
// should rarely, if ever, be subsequently rejected with this status. | ||
PodInfeasible PodStatus = "Infeasible" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Infeasible a commonly used term in scheduling? The condition feels more like "GaveUp" or "CouldNotSchedule" - I just anticipate explaining Infeasible to users of naive clients in the future, was hoping for something more related to scheduling. "OutOfSpace"? "NotPossible"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton Being out of space can be a temporary condition, like if some machines are temporarily removed from the cluster due to an error, but will be replaced soon.
So, is the right behavior to accept the request and keep working on it, or to reject the request after some amount of time? The former has the advantage that client code does not need to retry in the case of a temporary shortage of machines.
On the other hand, if the client is never going to succeed without changing something external to the system (auth failed, or not enough quota, etc) then a prompt rejection makes sense.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with your statements. I'm mostly looking for a better word than Infeasible that conveys the intent more precisely without being too vernacular (i.e. GaveUp).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so my thoughts:
We need authentication/authorization checks. But those are things we (should) check at an RPC level. So if you fail them, we reject your RPC. So we don't need a pod status saying it was authorized/authenticated.
The next thing we want to check is "do you have quota/does the cluster have space". In my mind, that's a check that a scheduler has to have anyway, so we don't want to do it synchronously in the apiserver. Therefore, we make a pod with status "New" and store it in our storage layer.
A scheduler sees that a wild pod suddenly appears. It does the quota/cluster space check, which is cheap. If the pod fails, the scheduler will mark it as infeasible. If the passes, the scheduler marks it as "Accepted" or "Admitted" or some such.
At this point, the apiserver can notice the change in etcd, and return a response to the user. We currently block until a pod is actually running on its assigned host, but I'm pretty sure that's a little extreme. We can consider having apiserver just delete pods marked infeasible instead of accepted from the system entirely.
The next step, from the pod's perspective, assuming it's accepted, is that the scheduler runs a scheduling pass and finds it a place to go. Then it's status changes to "Assigned" and it has a host to live on.
Then the kubelet runs it, and the podcache eventually sees that it's running and updates its status to "Running".
I'm not in love with any of the names for these states, or even this particular control flow. The only things that are important to me is that a) the scheduler runs asynchronously and in a separate process from apiserver, and b) the scheduler does the quota/fit test in addition to the final scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the scheduler need to do the quota check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abusive scheduling (schedule X thousand of something) seems like something the apiserver has to defend against...
That's a good point. Apiserver should defend against this, not a scheduler. OK. So api server has to do a quota check. :( :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp @smarterclayton @erictune
Sorry for joining the discussion late. Many deep issues affecting the API layering and system architecture are entangled in this discussion. So, first I'll do a brain dump, then we can discuss how to break up the problem and converge on concrete solutions.
Admission control: Should the request to run this particular Pod be accepted? Clients definitely need to be able to distinguish permanent from temporary rejection, for some definition of "permanent" in a dynamic environment, and users, and in some cases systems, need actionable information about the cause of the rejection. For instance, the rejection could be due to authorization failure, invalid request, exceeding budget or quota, etc.
However, the same question may also apply to sets of pods. For instance, a batch scheduler may want to make an admission-control decision based on whether a whole set of pods could be gang-scheduled simultaneously. A higher-level service placement system may want to determine whether a whole service could "fit". Update/rollout tools may want to verify that the updated resource requirements will work in aggregate. So, we'll likely need a way to invoke an admission-control decision on an aggregate resource request of some kind, without actually instantiating the underlying resource consumers (i.e., pods in this case).
Additionally, typically these kinds of decisions also want some assurance that the resources will be available beyond the current moment. For deployments of k8s on physical hosts, this may involve ensuring that enough spare capacity has been provisioned, taking into account potential host failures and maintenance. Being able to satisfy the resource requests "now" isn't sufficient. For deployments on virtual hosts, we'll need to figure out what interfaces we need to wrap around the underlying virtual resource provisioning mechanisms. In general, we want to ensure we have the means to communicate resource demand to provisioning mechanisms, including both host provisioning and quota provisioning.
We'll also need to figure out how we should handle resource (de)fragmentation and workload churn in these decisions. This is related to Clayton's point about retry and back-off.
If we reject pod creation requests due to instantaneous infeasibility/unsatisfiability, then we create a situation where services could be starved indefinitely of resources, regardless of priority, fairness, etc. Mesos addressed this with its resource offer mechanism: http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf. Some sort of resource request queue or scoreboard is needed. Probably we should involve Mesosphere (@adam-mesos) in the discussion of the approach to take here, as well as in the resource demand signaling discussed above.
Speaking of priority, that raises the issue of preemption in order to satisfy higher-priority requests.
Synchronous vs. asynchronous admission control and/or scheduling: Synchronous is easier for users to deal with, but asynchronous provides more flexibility and transparency to higher-level infrastructure layers. A synchronous interface can always be layered upon an asynchronous API, on the server or in libraries, UI, CLI, etc. If asynchronous, that's where it becomes necessary to represent that the request is awaiting an external dependency to be satisfied. Which leads me to...
An explicit enumeration of states and state machine, especially where those states may be persisted and/or interpreted by clients: This approach has a number of problems (e.g., the persisted state value may diverge from the actual current status), but the biggest is that it's not extensible. Components of the system, clients, user applications, etc. will bake in a particular set of states. The bigger the ecosystem becomes, the harder it becomes to add to that set of states, and it quickly becomes impossible to even change the observed runtime behavior of the existing states. Components don't know how to handle new, unrecognized states. Extensions to the system have no way of injecting new states dynamically in a running system, so they instead use hacks rather than states. Instead, I think there are just 3 relevant states: not yet running, running, and terminated.
There may be many reasons for not yet running. We should have a separate field to indicate the reason, which shouldn't be a strict enumeration, but which should represent what component/decision we're waiting on. It should be possible to dynamically orchestrate a chain of decisions (e.g., resource prediction -> cluster selection -> gang scheduling -> cluster admission control -> instance scheduling) using a continuation-passing approach. (Related to that, some kind of event bus will be needed in order to kick off automation that isn't on the critical path of instantiating the objects.)
Similarly, there are many reasons for termination -- see #137 .
Interface with scheduler: Did this PR imply that the interface should be implicit through the shared state in etcd? Let's not do that. Let's hash out a real API, here or elsewhere.
Master vs. Kubelet APIs: We're going to have to split these, ideally in a way that they cleanly compose.
Desired state vs. current state: I don't like the way these are represented in the current API. We need to split them into separate schemas with distinct (but related) API endpoints. The current approach is a source of user confusion, is hard to even document, and will be problematic for declarative config systems.
Did I miss any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Brian, I thought you might have something to say about this PR. :)
Two comments:
Synchronous vs. asynchronous admission control and/or scheduling: Synchronous is easier for users to deal with, but asynchronous provides more flexibility and transparency to higher-level infrastructure layers. A synchronous interface can always be layered upon an asynchronous API, on the server or in libraries, UI, CLI, etc. If asynchronous, that's where it becomes necessary to represent that the request is awaiting an external dependency to be satisfied.
k8s currently has such a set up (asynchronous API, but with timeouts and if needed, polling client-side to make it synchronous).
Interface with scheduler: Did this PR imply that the interface should be implicit through the shared state in etcd? Let's not do that. Let's hash out a real API, here or elsewhere.
No, I wasn't intending to make a shared-state interface; I don't plan to have the scheduler talk to etcd at all. I'll make a PR with an explicit scheduler interface soonish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being even later to the discussion; still trying to grok all these related threads. Here are my thoughts, particularly as related to Mesos (or a Mesos framework) as a consumer of the scheduler API and pod states.
PodStatus correlates closely to Mesos TaskStates. For your reference Mesos has the following TaskStates:
- Staging (Mesos asked to launch task, but not yet downloaded to slave)
- Starting (task+executor downloaded to slave, executor started and ready to run task)
- Running (task is now running)
- Finished (task completed successfully)
- Failed (task failed during execution)
- Killed (killed explicitly by client/user)
- Lost (invalid task info, not enough resources, slave disconnected/removed, task no longer found, authorization or other system failure; each distinguished by an additional message string passed alongside the TaskState).
Mesos has no notion of 'New' or 'Admitted' as those are states internal to the framework-scheduler which will choose to eventually accept one of Mesos' resource offers with a LaunchTask message, at which point the Mesos master considers the task 'Staging'.
PodStatuses: I don't see the value in separating the Accepted state from the Assigned. Since this is an asynchronous, distributed system, a pod that was once Accepted/Feasible may no longer be feasible by the time the scheduler gets ahold of it to schedule/assign it. In fact, even after the scheduler has made an assignment and posted that to the registry, that machine could fill up or go down, meaning that the pod must be rejected/rescheduled anyway.
Brian's simplified NotYetRunning, Running, and Terminated makes sense to me. Mesos clients often write a helper function like isTerminal() that checks for any of Finished/Failed/Killed/Lost, so I can see the benefits of grouping those together alongside an additional reason/message. But I also see from the Scheduler API perspective the benefit of separating New (please schedule me) from Pending/Assigned/Staging (trying to schedule/start). Seems like you can keep Pending, Running, and Stopped, and just add New (or New can be expressed as a message to the scheduler).
Scheduler interface: +1 on not relying on shared state as an implicit API. I want an explicit API where the scheduler is notified of new pods and the scheduler can announce its assignment for each pod.
Handling infeasibility: "is the right behavior to accept the request and keep working on it, or to reject the request after some amount of time" - exactly the reason Mesos uses a resource offer model. If the client asked for something infeasible, Mesos responds with "Here's what I can offer you." and the client can decide whether to accept or decline the offer. In the Kubernetes scenario, I would lean toward a prompt rejection (with helpful error message/code) and let the client retry to its heart's content.
Pod-validity checks: I feel like can-it-fit and quota/fairness checks should be handled by the scheduler. The API server should just be responsible for adding a pod request to the registry, and then the scheduler can figure out whether/where to launch/schedule the pod. In a Mesos scheduler implementation, the pool of available resource offers is constantly changing, so the can-it-fit check would need to be closely tied to the scheduling/assignment.
Maybe apiserver could also have its own preliminary check for user quotas, abusive scheduling, etc. Each step in the process should have a way of erroring-out with a helpful error message/code.
Gang Scheduling: +1. To hoard or wait? Probably deferrable to a separate discussion.
Preemption: +1, but this can probably be deferred to a separate discussion too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler interface: +1 on not relying on shared state as an implicit API. I want an explicit API where the scheduler is notified of new pods and the scheduler can announce its assignment for each pod.
@adam-mesos Please also take a look at #592.
@@ -62,30 +60,21 @@ func (registry *EtcdRegistry) helper() *tools.EtcdHelper { | |||
// ListPods obtains a list of pods that match selector. | |||
func (registry *EtcdRegistry) ListPods(selector labels.Selector) ([]api.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like you're using the selector anymore, just getting a list of all pods in the registry. Shouldn't you loop through the pods returned by ExtractList and check selector.Matches(labels.Set(pod.Labels))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good catch. Watch will do the same (I can't easily implement watch until after some of the changes this PR makes get in).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fixed now, below.
Thanks everyone for your input. I will iterate on this PR and #592 by the end of the week, I hope. |
I believe I've responded to everyone's concerns. I called the three states "Waiting", "Running", and "Terminated". |
@@ -200,9 +200,12 @@ type PodStatus string | |||
|
|||
// These are the valid statuses of pods. | |||
const ( | |||
// PodWaiting means that we're waiting for the pod to begin running. | |||
PodWaiting = "Waiting" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type as a PodStatus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will fix.
Are there any further comments on this? It's a pain to keep it rebased. |
LGTM |
This PR looks good. I'd still like to discuss Bindings with you separately. |
Thanks, @adam-mesos. I'll rebase this and get it pushed later tonight or tomorrow. |
1. Change names of Pod statuses (Waiting, Running, Terminated). 2. Store assigned host in etcd. 3. Change pod key to /registry/pods/<podid>. Container location remains the same (/registry/hosts/<machine>/kublet).
Finally got this thing rebased & tests passing again. |
@brendandburns @thockin @smarterclayton Merge me, maybe? |
And make tests pass again.
if !ok { | ||
return nil, fmt.Errorf("unexpected object: %#v", obj) | ||
} | ||
pod.CurrentState.Host = machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make it clear that this is for convenience, it doesn't represent truth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm, discussed on IM, this should be desired state, and then current state should be set via observation of the kubelet.
@brendandburns PTAL |
LGTM. Will merge when Travis passes. |
Prepare for external scheduler
Add ManagerMock.
…nd_error Bug 1926285: UPSTREAM: <carry>: ignore not found errors in status messages
Make sure the path to known-modules.json is relative
the same (/registry/hosts//kublet).