Prepare for external scheduler #557

lavalamp · 2014-07-22T06:15:10Z

Change name of pod statuses (Waiting, Running, Terminated)
Store assigned host in etcd.
Change pod key to /registry/pods/. Container location remains
the same (/registry/hosts//kublet).

lavalamp · 2014-07-22T06:16:08Z

I'd like @bgrant0607 to take a look at the PodStatuses I'm adding.

smarterclayton · 2014-07-22T20:32:48Z

pkg/api/types.go

+	// a pod would get this status after having had status "New" because a scheduler decided
+	// that the system doesn't have a space for the pod. Pods that were previously "Admitted"
+	// should rarely, if ever, be subsequently rejected with this status.
+	PodInfeasible PodStatus = "Infeasible"


Is Infeasible a commonly used term in scheduling? The condition feels more like "GaveUp" or "CouldNotSchedule" - I just anticipate explaining Infeasible to users of naive clients in the future, was hoping for something more related to scheduling. "OutOfSpace"? "NotPossible"?

@smarterclayton Being out of space can be a temporary condition, like if some machines are temporarily removed from the cluster due to an error, but will be replaced soon.

So, is the right behavior to accept the request and keep working on it, or to reject the request after some amount of time? The former has the advantage that client code does not need to retry in the case of a temporary shortage of machines.

On the other hand, if the client is never going to succeed without changing something external to the system (auth failed, or not enough quota, etc) then a prompt rejection makes sense.

Thoughts?

Agree with your statements. I'm mostly looking for a better word than Infeasible that conveys the intent more precisely without being too vernacular (i.e. GaveUp).

OK, so my thoughts:

We need authentication/authorization checks. But those are things we (should) check at an RPC level. So if you fail them, we reject your RPC. So we don't need a pod status saying it was authorized/authenticated.

The next thing we want to check is "do you have quota/does the cluster have space". In my mind, that's a check that a scheduler has to have anyway, so we don't want to do it synchronously in the apiserver. Therefore, we make a pod with status "New" and store it in our storage layer.

A scheduler sees that a wild pod suddenly appears. It does the quota/cluster space check, which is cheap. If the pod fails, the scheduler will mark it as infeasible. If the passes, the scheduler marks it as "Accepted" or "Admitted" or some such.

At this point, the apiserver can notice the change in etcd, and return a response to the user. We currently block until a pod is actually running on its assigned host, but I'm pretty sure that's a little extreme. We can consider having apiserver just delete pods marked infeasible instead of accepted from the system entirely.

The next step, from the pod's perspective, assuming it's accepted, is that the scheduler runs a scheduling pass and finds it a place to go. Then it's status changes to "Assigned" and it has a host to live on.

Then the kubelet runs it, and the podcache eventually sees that it's running and updates its status to "Running".

I'm not in love with any of the names for these states, or even this particular control flow. The only things that are important to me is that a) the scheduler runs asynchronously and in a separate process from apiserver, and b) the scheduler does the quota/fit test in addition to the final scheduling.

Why does the scheduler need to do the quota check?

Abusive scheduling (schedule X thousand of something) seems like something the apiserver has to defend against...

That's a good point. Apiserver should defend against this, not a scheduler. OK. So api server has to do a quota check. :( :(

@lavalamp @smarterclayton @erictune
Sorry for joining the discussion late. Many deep issues affecting the API layering and system architecture are entangled in this discussion. So, first I'll do a brain dump, then we can discuss how to break up the problem and converge on concrete solutions.

Admission control: Should the request to run this particular Pod be accepted? Clients definitely need to be able to distinguish permanent from temporary rejection, for some definition of "permanent" in a dynamic environment, and users, and in some cases systems, need actionable information about the cause of the rejection. For instance, the rejection could be due to authorization failure, invalid request, exceeding budget or quota, etc.

However, the same question may also apply to sets of pods. For instance, a batch scheduler may want to make an admission-control decision based on whether a whole set of pods could be gang-scheduled simultaneously. A higher-level service placement system may want to determine whether a whole service could "fit". Update/rollout tools may want to verify that the updated resource requirements will work in aggregate. So, we'll likely need a way to invoke an admission-control decision on an aggregate resource request of some kind, without actually instantiating the underlying resource consumers (i.e., pods in this case).

Additionally, typically these kinds of decisions also want some assurance that the resources will be available beyond the current moment. For deployments of k8s on physical hosts, this may involve ensuring that enough spare capacity has been provisioned, taking into account potential host failures and maintenance. Being able to satisfy the resource requests "now" isn't sufficient. For deployments on virtual hosts, we'll need to figure out what interfaces we need to wrap around the underlying virtual resource provisioning mechanisms. In general, we want to ensure we have the means to communicate resource demand to provisioning mechanisms, including both host provisioning and quota provisioning.

We'll also need to figure out how we should handle resource (de)fragmentation and workload churn in these decisions. This is related to Clayton's point about retry and back-off.

If we reject pod creation requests due to instantaneous infeasibility/unsatisfiability, then we create a situation where services could be starved indefinitely of resources, regardless of priority, fairness, etc. Mesos addressed this with its resource offer mechanism: http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf. Some sort of resource request queue or scoreboard is needed. Probably we should involve Mesosphere (@adam-mesos) in the discussion of the approach to take here, as well as in the resource demand signaling discussed above.

Speaking of priority, that raises the issue of preemption in order to satisfy higher-priority requests.

Synchronous vs. asynchronous admission control and/or scheduling: Synchronous is easier for users to deal with, but asynchronous provides more flexibility and transparency to higher-level infrastructure layers. A synchronous interface can always be layered upon an asynchronous API, on the server or in libraries, UI, CLI, etc. If asynchronous, that's where it becomes necessary to represent that the request is awaiting an external dependency to be satisfied. Which leads me to...

An explicit enumeration of states and state machine, especially where those states may be persisted and/or interpreted by clients: This approach has a number of problems (e.g., the persisted state value may diverge from the actual current status), but the biggest is that it's not extensible. Components of the system, clients, user applications, etc. will bake in a particular set of states. The bigger the ecosystem becomes, the harder it becomes to add to that set of states, and it quickly becomes impossible to even change the observed runtime behavior of the existing states. Components don't know how to handle new, unrecognized states. Extensions to the system have no way of injecting new states dynamically in a running system, so they instead use hacks rather than states. Instead, I think there are just 3 relevant states: not yet running, running, and terminated.

There may be many reasons for not yet running. We should have a separate field to indicate the reason, which shouldn't be a strict enumeration, but which should represent what component/decision we're waiting on. It should be possible to dynamically orchestrate a chain of decisions (e.g., resource prediction -> cluster selection -> gang scheduling -> cluster admission control -> instance scheduling) using a continuation-passing approach. (Related to that, some kind of event bus will be needed in order to kick off automation that isn't on the critical path of instantiating the objects.)

Similarly, there are many reasons for termination -- see #137 .

Interface with scheduler: Did this PR imply that the interface should be implicit through the shared state in etcd? Let's not do that. Let's hash out a real API, here or elsewhere.

Master vs. Kubelet APIs: We're going to have to split these, ideally in a way that they cleanly compose.

Desired state vs. current state: I don't like the way these are represented in the current API. We need to split them into separate schemas with distinct (but related) API endpoints. The current approach is a source of user confusion, is hard to even document, and will be problematic for declarative config systems.

Did I miss any issues?

Thanks Brian, I thought you might have something to say about this PR. :)

Two comments:

Synchronous vs. asynchronous admission control and/or scheduling: Synchronous is easier for users to deal with, but asynchronous provides more flexibility and transparency to higher-level infrastructure layers. A synchronous interface can always be layered upon an asynchronous API, on the server or in libraries, UI, CLI, etc. If asynchronous, that's where it becomes necessary to represent that the request is awaiting an external dependency to be satisfied.

k8s currently has such a set up (asynchronous API, but with timeouts and if needed, polling client-side to make it synchronous).

Interface with scheduler: Did this PR imply that the interface should be implicit through the shared state in etcd? Let's not do that. Let's hash out a real API, here or elsewhere.

No, I wasn't intending to make a shared-state interface; I don't plan to have the scheduler talk to etcd at all. I'll make a PR with an explicit scheduler interface soonish.

Sorry for being even later to the discussion; still trying to grok all these related threads. Here are my thoughts, particularly as related to Mesos (or a Mesos framework) as a consumer of the scheduler API and pod states.

PodStatus correlates closely to Mesos TaskStates. For your reference Mesos has the following TaskStates:

Staging (Mesos asked to launch task, but not yet downloaded to slave)

Starting (task+executor downloaded to slave, executor started and ready to run task)

Running (task is now running)

Finished (task completed successfully)

Failed (task failed during execution)

Killed (killed explicitly by client/user)

Lost (invalid task info, not enough resources, slave disconnected/removed, task no longer found, authorization or other system failure; each distinguished by an additional message string passed alongside the TaskState).

Mesos has no notion of 'New' or 'Admitted' as those are states internal to the framework-scheduler which will choose to eventually accept one of Mesos' resource offers with a LaunchTask message, at which point the Mesos master considers the task 'Staging'.

PodStatuses: I don't see the value in separating the Accepted state from the Assigned. Since this is an asynchronous, distributed system, a pod that was once Accepted/Feasible may no longer be feasible by the time the scheduler gets ahold of it to schedule/assign it. In fact, even after the scheduler has made an assignment and posted that to the registry, that machine could fill up or go down, meaning that the pod must be rejected/rescheduled anyway.

Brian's simplified NotYetRunning, Running, and Terminated makes sense to me. Mesos clients often write a helper function like isTerminal() that checks for any of Finished/Failed/Killed/Lost, so I can see the benefits of grouping those together alongside an additional reason/message. But I also see from the Scheduler API perspective the benefit of separating New (please schedule me) from Pending/Assigned/Staging (trying to schedule/start). Seems like you can keep Pending, Running, and Stopped, and just add New (or New can be expressed as a message to the scheduler).

Scheduler interface: +1 on not relying on shared state as an implicit API. I want an explicit API where the scheduler is notified of new pods and the scheduler can announce its assignment for each pod.

Handling infeasibility: "is the right behavior to accept the request and keep working on it, or to reject the request after some amount of time" - exactly the reason Mesos uses a resource offer model. If the client asked for something infeasible, Mesos responds with "Here's what I can offer you." and the client can decide whether to accept or decline the offer. In the Kubernetes scenario, I would lean toward a prompt rejection (with helpful error message/code) and let the client retry to its heart's content.

Pod-validity checks: I feel like can-it-fit and quota/fairness checks should be handled by the scheduler. The API server should just be responsible for adding a pod request to the registry, and then the scheduler can figure out whether/where to launch/schedule the pod. In a Mesos scheduler implementation, the pool of available resource offers is constantly changing, so the can-it-fit check would need to be closely tied to the scheduling/assignment.
Maybe apiserver could also have its own preliminary check for user quotas, abusive scheduling, etc. Each step in the process should have a way of erroring-out with a helpful error message/code.

Gang Scheduling: +1. To hoard or wait? Probably deferrable to a separate discussion.

Preemption: +1, but this can probably be deferred to a separate discussion too.

Scheduler interface: +1 on not relying on shared state as an implicit API. I want an explicit API where the scheduler is notified of new pods and the scheduler can announce its assignment for each pod.

@adam-mesos Please also take a look at #592.

adam-mesos · 2014-07-31T06:32:44Z

pkg/registry/etcd_registry.go

@@ -62,30 +60,21 @@ func (registry *EtcdRegistry) helper() *tools.EtcdHelper {
 // ListPods obtains a list of pods that match selector.
 func (registry *EtcdRegistry) ListPods(selector labels.Selector) ([]api.Pod, error) {


Doesn't look like you're using the selector anymore, just getting a list of all pods in the registry. Shouldn't you loop through the pods returned by ExtractList and check selector.Matches(labels.Set(pod.Labels))?

Yes, that's a good catch. Watch will do the same (I can't easily implement watch until after some of the changes this PR makes get in).

This is fixed now, below.

lavalamp · 2014-07-31T16:04:32Z

Thanks everyone for your input. I will iterate on this PR and #592 by the end of the week, I hope.

lavalamp · 2014-08-01T01:09:49Z

I believe I've responded to everyone's concerns. I called the three states "Waiting", "Running", and "Terminated".

adam-mesos · 2014-08-01T08:48:53Z

pkg/api/types.go

@@ -200,9 +200,12 @@ type PodStatus string

 // These are the valid statuses of pods.
 const (
+	// PodWaiting means that we're waiting for the pod to begin running.
+	PodWaiting = "Waiting"


Type as a PodStatus?

Thanks, will fix.

lavalamp · 2014-08-06T23:39:34Z

Are there any further comments on this? It's a pain to keep it rebased.

smarterclayton · 2014-08-06T23:48:18Z

LGTM

adam-mesos · 2014-08-07T01:37:21Z

This PR looks good. I'd still like to discuss Bindings with you separately.

lavalamp · 2014-08-07T01:40:53Z

Thanks, @adam-mesos. I'll rebase this and get it pushed later tonight or tomorrow.

1. Change names of Pod statuses (Waiting, Running, Terminated). 2. Store assigned host in etcd. 3. Change pod key to /registry/pods/<podid>. Container location remains the same (/registry/hosts/<machine>/kublet).

lavalamp · 2014-08-11T01:58:25Z

Finally got this thing rebased & tests passing again.

lavalamp · 2014-08-11T20:04:13Z

@brendandburns @thockin @smarterclayton Merge me, maybe?

And make tests pass again.

brendandburns · 2014-08-11T20:11:57Z

pkg/registry/etcdregistry.go

+			if !ok {
+				return nil, fmt.Errorf("unexpected object: %#v", obj)
+			}
+			pod.CurrentState.Host = machine


Can you make it clear that this is for convenience, it doesn't represent truth

nm, discussed on IM, this should be desired state, and then current state should be set via observation of the kubelet.

lavalamp · 2014-08-11T22:15:08Z

@brendandburns PTAL

brendandburns · 2014-08-11T22:17:56Z

LGTM. Will merge when Travis passes.

Prepare for external scheduler

Add ManagerMock.

…nd_error Bug 1926285: UPSTREAM: <carry>: ignore not found errors in status messages

Make sure the path to known-modules.json is relative

smarterclayton reviewed Jul 22, 2014
View reviewed changes

lavalamp mentioned this pull request Jul 24, 2014

We need to refactor the scheduler. #354

Closed

bgrant0607 mentioned this pull request Jul 24, 2014

DNS #146

Closed

adam-mesos reviewed Jul 31, 2014
View reviewed changes

lavalamp mentioned this pull request Aug 1, 2014

Proposal for scheduler API #592

Merged

adam-mesos reviewed Aug 1, 2014
View reviewed changes

Prepare for external scheduler

5cdce0e

1. Change names of Pod statuses (Waiting, Running, Terminated). 2. Store assigned host in etcd. 3. Change pod key to /registry/pods/<podid>. Container location remains the same (/registry/hosts/<machine>/kublet).

Add debugging info printing to etcd fake

b7752a8

And make tests pass again.

brendandburns reviewed Aug 11, 2014
View reviewed changes

Use DesiredState rather than CurrentState for Host.

b5352a8

brendandburns added a commit that referenced this pull request Aug 11, 2014

Merge pull request #557 from lavalamp/podLocation

3222f61

Prepare for external scheduler

brendandburns merged commit 3222f61 into kubernetes:master Aug 11, 2014

bgrant0607 mentioned this pull request Sep 3, 2014

Pod state / status #1146

Closed

vishh pushed a commit to vishh/kubernetes that referenced this pull request Apr 6, 2016

Merge pull request kubernetes#557 from vmarmol/kube

e2e5456

Add ManagerMock.

magnetik mentioned this pull request Apr 26, 2019

Deb packages for other distros #66300

Closed

lavalamp deleted the podLocation branch September 1, 2020 22:56

rphillips pushed a commit to rphillips/kubernetes that referenced this pull request Feb 10, 2021

Merge pull request kubernetes#557 from rphillips/fixes/ignore_not_fou…

52d6e7e

…nd_error Bug 1926285: UPSTREAM: <carry>: ignore not found errors in status messages

linxiulei pushed a commit to linxiulei/kubernetes that referenced this pull request Jan 18, 2024

Merge pull request kubernetes#557 from vteratipally/adfad

9c54169

Make sure the path to known-modules.json is relative

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare for external scheduler #557

Prepare for external scheduler #557

lavalamp commented Jul 22, 2014

lavalamp commented Jul 22, 2014

smarterclayton Jul 22, 2014

erictune Jul 22, 2014

smarterclayton Jul 22, 2014

lavalamp Jul 23, 2014

erictune Jul 23, 2014

lavalamp Jul 23, 2014

bgrant0607 Jul 23, 2014

lavalamp Jul 23, 2014

adam-mesos Jul 31, 2014

lavalamp Jul 31, 2014

adam-mesos Jul 31, 2014

lavalamp Jul 31, 2014

lavalamp Aug 1, 2014

lavalamp commented Jul 31, 2014

lavalamp commented Aug 1, 2014

adam-mesos Aug 1, 2014

lavalamp Aug 1, 2014

lavalamp commented Aug 6, 2014

smarterclayton commented Aug 6, 2014

adam-mesos commented Aug 7, 2014

lavalamp commented Aug 7, 2014

lavalamp commented Aug 11, 2014

lavalamp commented Aug 11, 2014

brendandburns Aug 11, 2014

brendandburns Aug 11, 2014

lavalamp commented Aug 11, 2014

brendandburns commented Aug 11, 2014

		@@ -62,30 +60,21 @@ func (registry EtcdRegistry) helper() tools.EtcdHelper {
		// ListPods obtains a list of pods that match selector.
		func (registry *EtcdRegistry) ListPods(selector labels.Selector) ([]api.Pod, error) {

Prepare for external scheduler #557

Prepare for external scheduler #557

Conversation

lavalamp commented Jul 22, 2014

lavalamp commented Jul 22, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Jul 31, 2014

lavalamp commented Aug 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Aug 6, 2014

smarterclayton commented Aug 6, 2014

adam-mesos commented Aug 7, 2014

lavalamp commented Aug 7, 2014

lavalamp commented Aug 11, 2014

lavalamp commented Aug 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Aug 11, 2014

brendandburns commented Aug 11, 2014