diff --git a/keps/sig-scheduling/20180409-scheduling-framework-extensions.png b/keps/sig-scheduling/20180409-scheduling-framework-extensions.png index 25f50471010..e2c1a2f841e 100644 Binary files a/keps/sig-scheduling/20180409-scheduling-framework-extensions.png and b/keps/sig-scheduling/20180409-scheduling-framework-extensions.png differ diff --git a/keps/sig-scheduling/20180409-scheduling-framework-threads.png b/keps/sig-scheduling/20180409-scheduling-framework-threads.png index ae9e1965d6d..34c2bde759c 100644 Binary files a/keps/sig-scheduling/20180409-scheduling-framework-threads.png and b/keps/sig-scheduling/20180409-scheduling-framework-threads.png differ diff --git a/keps/sig-scheduling/20180409-scheduling-framework.md b/keps/sig-scheduling/20180409-scheduling-framework.md index 5caf7ff9d9d..ff69f209f35 100644 --- a/keps/sig-scheduling/20180409-scheduling-framework.md +++ b/keps/sig-scheduling/20180409-scheduling-framework.md @@ -1,7 +1,9 @@ --- +kep-number: 34 title: Scheduling Framework authors: - - "@bsalamat" + - '@bsalamat' + - '@misterikkit' owning-sig: sig-scheduling participating-sigs: [] reviewers: @@ -10,443 +12,648 @@ approvers: - TBD editor: TBD creation-date: 2018-04-09 -last-updated: 2018-08-15 +last-updated: 2019-01-29 status: draft see-also: [] replaces: - - https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduling-framework.md + - >- + https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduling-framework.md superseded-by: [] --- - # Scheduling Framework - - -- [SUMMARY ](#summary-) -- [OBJECTIVE](#objective) - - [Terminology](#terminology) -- [BACKGROUND](#background) -- [OVERVIEW](#overview) - - [Non-goals](#non-goals) -- [DETAILED DESIGN](#detailed-design) - - [Bare bones of scheduling](#bare-bones-of-scheduling) - - [Communication and statefulness of plugins](#communication-and-statefulness-of-plugins) - - [Plugin registration](#plugin-registration) - - [Extension points](#extension-points) - - [Scheduling queue sort](#scheduling-queue-sort) - - [Pre-filter](#pre-filter) - - [Filter](#filter) - - [Post-filter](#post-filter) - - [Scoring](#scoring) - - [Post-scoring/pre-reservation](#post-scoringpre-reservation) - - [Reserve](#reserve) - - [Permit](#permit) - - [Approving a Pod binding](#approving-a-pod-binding) - - [Reject](#reject) - - [Pre-Bind](#pre-bind) - - [Bind](#bind) - - [Post Bind](#post-bind) -- [USE-CASES](#use-cases) - - [Dynamic binding of cluster-level resources](#dynamic-binding-of-cluster-level-resources) - - [Gang Scheduling](#gang-scheduling) -- [OUT OF PROCESS PLUGINS](#out-of-process-plugins) -- [CONFIGURING THE SCHEDULING FRAMEWORK](#configuring-the-scheduling-framework) -- [BACKWARD COMPATIBILITY WITH SCHEDULER v1](#backward-compatibility-with-scheduler-v1) -- [DEVELOPMENT PLAN](#development-plan) -- [TESTING PLAN](#testing-plan) -- [WORK ESTIMATES ](#work-estimates) - -# SUMMARY + + +* [SUMMARY](#summary) +* [MOTIVATION](#motivation) + * [Goals](#goals) + * [Non-Goals](#non-goals) +* [PROPOSAL](#proposal) + * [Scheduling Cycle](#scheduling-cycle) + * [Extension points](#extension-points) + * [Queue sort](#queue-sort) + * [Pre-filter](#pre-filter) + * [Filter](#filter) + * [Post-filter](#post-filter) + * [Scoring](#scoring) + * [Normalize scoring](#normalize-scoring) + * [Reserve](#reserve) + * [Permit](#permit) + * [Pre-bind](#pre-bind) + * [Bind](#bind) + * [Post-bind](#post-bind) + * [Un-reserve](#un-reserve) + * [Plugin API](#plugin-api) + * [PluginContext](#plugincontext) + * [PluginHandle](#pluginhandle) + * [Plugin Registration](#plugin-registration) + * [Plugin Lifecycle](#plugin-lifecycle) + * [Initialization](#initialization) + * [Concurrency](#concurrency) + * [Configuring Plugins](#configuring-plugins) + * [Enable/Disable](#enabledisable) + * [Change Evaluation Order](#change-evaluation-order) + * [Optional Args](#optional-args) + * [Backward compatibility](#backward-compatibility) + * [Interactions with Cluster Autoscaler](#interactions-with-cluster-autoscaler) +* [USE CASES](#use-cases) + * [Coscheduling](#coscheduling) + * [Dynamic Resource Binding](#dynamic-resource-binding) + * [Custom Scheduler Plugins (out of tree)](#custom-scheduler-plugins-out-of-tree) +* [GRADUATION CRITERIA](#graduation-criteria) +* [IMPLEMENTATION HISTORY](#implementation-history) + + + +# SUMMARY This document describes the Kubernetes Scheduling Framework. The scheduling -framework implements only basic functionality, but exposes many extension points -for plugins to expand its functionality. The plan is that this framework (with -its plugins) will eventually replace the current Kubernetes scheduler. - -# OBJECTIVE - -- make scheduler more extendable. -- Make scheduler core simpler by moving some of its features to plugins. -- Propose extension points in the framework. -- Propose a mechanism to receive plugin results and continue or abort based - on the received results. -- Propose a mechanism to handle errors and communicate it with plugins. - -## Terminology - -Scheduler v1, current scheduler: refer to existing scheduler of Kubernetes. -Scheduler v2, scheduling framework: refer to the new scheduler proposed in this -doc. - -# BACKGROUND - -Many features are being added to the Kubernetes default scheduler. They keep -making the code larger and logic more complex. A more complex scheduler is -harder to maintain, its bugs are harder to find and fix, and those users running -a custom scheduler have a hard time catching up and integrating new changes. -The current Kubernetes scheduler provides -[webhooks to extend](./scheduler_extender.md) -its functionality. However, these are limited in a few ways: - -1. The number of extension points are limited: "Filter" extenders are called - after default predicate functions. "Prioritize" extenders are called after - default priority functions. "Preempt" extenders are called after running - default preemption mechanism. "Bind" verb of the extenders are used to bind - a Pod. Only one of the extenders can be a binding extender, and that - extender performs binding instead of the scheduler. Extenders cannot be - invoked at other points, for example, they cannot be called before running - predicate functions. -1. Every call to the extenders involves marshaling and unmarshalling JSON. - Calling a webhook (HTTP request) is also slower than calling native functions. -1. It is hard to inform an extender that scheduler has aborted scheduling of - a Pod. For example, if an extender provisions a cluster resource and - scheduler contacts the extender and asks it to provision an instance of the - resource for the Pod being scheduled and then scheduler faces errors - scheduling the Pod and decides to abort the scheduling, it will be hard to - communicate the error with the extender and ask it to undo the provisioning - of the resource. -1. Since current extenders run as a separate process, they cannot use - scheduler's cache. They must either build their own cache from the API - server or process only the information they receive from the default scheduler. +framework is a new set of "plugin" APIs being added to the existing Kubernetes +Scheduler. Plugins are compiled into the scheduler, and these APIs allow many +scheduling features to be implemented as plugins, while keeping the scheduling +"core" simple and maintainable. + +*Note: Previous versions of this document proposed replacing the existing +scheduler with a new implementation.* + +# MOTIVATION + +Many features are being added to the Kubernetes Scheduler. They keep making the +code larger and the logic more complex. A more complex scheduler is harder to +maintain, its bugs are harder to find and fix, and those users running a custom +scheduler have a hard time catching up and integrating new changes. The current +Kubernetes scheduler provides [webhooks to extend][] its functionality. However, +these are limited in a few ways: + +[webhooks to extend]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/scheduler_extender.md + +1. The number of extension points are limited: "Filter" extenders are called + after default predicate functions. "Prioritize" extenders are called after + default priority functions. "Preempt" extenders are called after running + default preemption mechanism. "Bind" verb of the extenders are used to bind + a Pod. Only one of the extenders can be a binding extender, and that + extender performs binding instead of the scheduler. Extenders cannot be + invoked at other points, for example, they cannot be called before running + predicate functions. +1. Every call to the extenders involves marshaling and unmarshalling JSON. + Calling a webhook (HTTP request) is also slower than calling native + functions. +1. It is hard to inform an extender that scheduler has aborted scheduling of a + Pod. For example, if an extender provisions a cluster resource and scheduler + contacts the extender and asks it to provision an instance of the resource + for the pod being scheduled and then scheduler faces errors scheduling the + pod and decides to abort the scheduling, it will be hard to communicate the + error with the extender and ask it to undo the provisioning of the resource. +1. Since current extenders run as a separate process, they cannot use + scheduler's cache. They must either build their own cache from the API + server or process only the information they receive from the default + scheduler. The above limitations hinder building high performance and versatile scheduler -extensions. We would ideally like to have an extension mechanism that is fast -enough to allow keeping a bare minimum logic in the scheduler core and convert -many of the existing features of default scheduler, such as predicate and -priority functions and preemption into plugins. Such plugins will be compiled -with the scheduler. We would also like to provide an extension mechanism that do -not need recompilation of scheduler. The expected performance of such plugins is -lower than in-process plugins. Such out-of-process plugins should be used in -cases where quick invocation of the plugin is not a constraint. - -# OVERVIEW - -Scheduler v2 allows both built-in and out-of-process extenders. This new -architecture is a scheduling framework that exposes several extension points -during a scheduling cycle. Scheduler plugins can register to run at one or more -extension points. - -#### Non-goals - -- We will keep Kubernetes API backward compatibility, but keeping scheduler - v1 backward compatibility is a non-goal. Particularly, scheduling policy - config and v1 extenders won't work in this new framework. -- Solve all the scheduler v1 limitations, although we would like to ensure - that the new framework allows us to address known limitations in the future. -- Provide implementation details of plugins and call-back functions, such as - all of their arguments and return values. - -# DETAILED DESIGN - -## Bare bones of scheduling - -Pods that are not assigned to any node go to a scheduling queue and sorted by -order specified by plugins (described [here](#scheduling-queue-sort)). The -scheduling framework picks the head of the queue and starts a **scheduling -cycle** to schedule the pod. At the end of the cycle scheduler determines -whether the pod is schedulable or not. If the pod is not schedulable, its status -is updated and goes back to the scheduling queue. If the pod is schedulable (one -or more nodes are found that can run the Pod), the scoring process is started. -The scoring process finds the best node to run the Pod. Once the best node is -picked, the scheduler updates its cache and then a bind go routine is started to -bind the pod. -The above process is the same as what Kubernetes scheduler v1 does. Some of the -essential features of scheduler v1, such as leader election, will also be -transferred to the scheduling framework. -In the rest of this section we describe how various plugins are used to enrich -this basic workflow. This document focuses on in-process plugins. -Out-of-process plugins are discussed later in a separate doc. - -## Communication and statefulness of plugins - -The scheduling framework provides a library that plugins can use to pass -information to other plugins. This library keeps a map from keys of type string -to opaque pointers of type interface{}. A write operation takes a key and a -pointer and stores the opaque pointer in the map with the given key. Other -plugins can provide the key and receive the opaque pointer. Multiple plugins can -share the state or communicate via this mechanism. -The saved state is preserved only during a single scheduling cycle. At the end -of a scheduling cycle, this map is destructed. So, plugins cannot keep shared -state across multiple scheduling cycle. They can, however, update the scheduler -cache via the provided interface of the cache. The cache interface allows -limited state preservation across multiple scheduling cycle. -It is worth noting that plugins are assumed to be **trusted**. Scheduler does -not prevent one plugin from accessing or modifying another plugin's state. - -## Plugin registration - -Plugin registration is done by providing an extension point and a function that -should be called at that extension point. This step will be something like: +features. We would ideally like to have an extension mechanism that is fast +enough to allow existing features to be converted into plugins, such as +predicate and priority functions. Such plugins will be compiled into the +scheduler binary. Additionally, authors of custom schedulers can compile a +custom scheduler using (unmodified) scheduler code and their own plugins. -```go -register("pre-filter", plugin.foo) -``` +## Goals + +- Make scheduler more extendable. +- Make scheduler core simpler by moving some of its features to plugins. +- Propose extension points in the framework. +- Propose a mechanism to receive plugin results and continue or abort based on + the received results. +- Propose a mechanism to handle errors and communicate them with plugins. + +## Non-Goals + +- Solve all scheduler limitations, although we would like to ensure that the + new framework allows us to address known limitations in the future. +- Provide implementation details of plugins and call-back functions, such as + all of their arguments and return values. + +# PROPOSAL + +The Scheduling Framework defines new extension points and Go APIs in the +Kubernetes Scheduler for use by "plugins". Plugins add scheduling behaviors to +the scheduler, and are included at compile time. The scheduler's ComponentConfig +will allow plugins to be enabled, disabled, and reordered. Custom schedulers can +write their plugins "[out-of-tree](#custom-scheduler-plugins-out-of-tree)" and +compile a scheduler binary with their own plugins included. -The details of the function signature will be provided later. +## Scheduling Cycle + +The main loop of the scheduler is referred to as a "scheduling cycle". Each +cycle covers the complete process of assigning one pod to a node (or determining +that the pod cannot be scheduled). Multiple scheduling cycles are started +serially, but some parts may run concurrently. (See [Concurrency](#concurrency)) ## Extension points -The following picture shows the scheduling cycle of a Pod and the extension +The following picture shows the scheduling cycle of a pod and the extension points that the scheduling framework exposes. In this picture "Filter" is -equivalent to "Predicate" in scheduler v1 and "Scoring" is equivalent to -"Priority function". Plugins are go functions. They are registered to be called -at one of these extension points. They are called by the framework in the same -order they are registered for each extension point. -In the following sections we describe each extension point in the same order -they are called in a schedule cycle. +equivalent to "Predicate" and "Scoring" is equivalent to "Priority function". +Plugins are registered to be called at one or more of these extension points. In +the following sections we describe each extension point in the same order they +are called in a scheduling cycle. + +One plugin may register at multiple extension points to perform more complex or +stateful tasks. ![image](20180409-scheduling-framework-extensions.png) -### Scheduling queue sort +### Queue sort -These plugins indicate how Pods should be sorted in the scheduling queue. A -plugin registered at this point only returns greater, smaller, or equal to -indicate an ordering between two Pods. In other words, a plugin at this -extension point returns the answer to "less(pod1, pod2)". Multiple plugins may -be registered at this point. Plugins registered at this point are called in -order and the invocation continues as long as plugins return "equal". Once a -plugin returns "greater" or "smaller" the invocation of these plugins are -stopped. +These plugins are used to sort pods in the scheduling queue. A queue sort plugin +essentially will provide a "less(pod1, pod2)" function. Only one queue sort +plugin may be enabled at a time. ### Pre-filter -These plugins are generally useful to check certain conditions that the cluster -or the Pod must meet. These are also useful to perform pre-processing on the pod -and store some information about the pod that can be used by other plugins. -The pod pointer is passed as an argument to these plugins. If any of these -plugins return an error, the scheduling cycle is aborted. -These plugins are called serially in the same order registered. +These plugins are used to pre-process info about the pod, or to check certain +conditions that the cluster or the pod must meet. If a pre-filter plugin returns +an error, the scheduling cycle is aborted. Pre-filter plugins are called +serially within a scheduling cycle. ### Filter -Filter plugins filter out nodes that cannot run the Pod. Scheduler runs these -plugins per node in the same order that they are registered, but scheduler may -run these filter function for multiple nodes in parallel. So, these plugins must -use synchronization when they modify state. -Scheduler stops running the remaining filter functions for a node once one of -these filters fails for the node. +These plugins are used to filter out nodes that cannot run the Pod. For each +node, the scheduler will call filter plugins in their configured order. If any +filter plugin marks the node as infeasible, the remaining plugins will not be +called for that node. Nodes may be evaluated concurrently. ### Post-filter -The Pod and the set of nodes that can run the Pod are passed to these plugins. -They are called whether Pod is schedulable or not (whether the set of nodes is -empty or non-empty). -If any of these plugins return an error or if the Pod is determined -unschedulable, the scheduling cycle is aborted. -These plugins are called serially. +This is an informational extension point. Plugins can will be called with a list +of nodes that passed the filtering phase. A plugin may use this data to update +internal state or to generate logs/metrics. + +**Note:** Plugins wishing to perform "pre-scoring" work should use the +post-filter extension point. ### Scoring -These plugins are similar to priority function in scheduler v1. They are -utilized to rank nodes that have passed the filtering stage. Similar to Filter -plugins, these are called per node serially in the same order registered, but -scheduler may run them for multiple nodes in parallel. -Each one of these functions return a score for the given node. The score is -multiplied by the weight of the function and aggregated with the result of other -scoring functions to yield a total score for the node. -These functions can never block scheduling. In case of an error they should -return zero for the Node being ranked. +These plugins are used to rank nodes that have passed the filtering phase. The +scheduler will call each scoring plugin for each node. There will be a well +defined range of integers representing the minimum and maximum scores. After the +[normalize scoring](#normalize-scoring) phase, the scheduler will combine node +scores from all plugins according to the configured plugin weights. -### Post-scoring/pre-reservation +If a scoring plugin returns an error, the scheduler will treat it as a zero +score. -After all scoring plugins are invoked and the score of nodes are determined, the -framework picks the best node with the highest score and then it calls -post-scoring plugins. The Pod and the chosen Node are passed to these plugins. -These plugins have one more chance to check any conditions about the assignment -of the Pod to this Node and reject the node if needed. +### Normalize scoring -![image](20180409-scheduling-framework-threads.png) +These plugins are used to modify scores before the scheduler computes a final +ranking of Nodes. A plugin that registers for this extension point will be +called with the [scoring](#scoring) results from the same plugin. This is called +once per plugin per scheduling cycle. + +For example, suppose a plugin `BlinkingLightScorer` ranks Nodes based on how +many blinking lights they have. + +```go +func ScoreNode(_ *v1.Pod, n *v1.Node) (int, error) { + return getBlinkingLightCount(n) +} +``` + +However, the maximum count of blinking lights may be small compared to +`NodeScoreMax`. To fix this, `BlinkingLightScorer` should also register for this +extension point. + +```go +func NormalizeScores(scores map[string]int) { + highest := 0 + for _, score := range scores { + highest = max(highest, score) + } + for node, score := range scores { + scores[node] = score*NodeScoreMax/highest + } +} +``` + +If any normalize-scoring plugin returns an error, the scheduling cycle is +aborted. + +**Note:** Plugins wishing to perform "pre-reserve" work should use the +normalize-scoring extension point. ### Reserve -At this point scheduler updates its cache by "reserving" a Node (partially or -fully) for the Pod. In scheduler v1 this stage is called "assume". -At this point, only the scheduler cache is updated to -reflect that the Node is (partially) reserved for the Pod. The scheduling -framework calls plugins registered at this extension points so that they get a -chance to perform cache updates or other accounting activities. These plugins -do not return any value (except errors). +This is an informational extension point. Plugins which maintain runtime state +(aka "stateful plugins") should use this extension point to be notified by the +scheduler when resources on a node are being reserved for a given Pod. This +happens before the scheduler actually binds the pod to the Node, and it exists +to prevent race conditions while the scheduler waits for the bind to succeed. + +Once a pod is in the reserved state, it will either trigger +[Un-reserve](#un-reserve) plugins (on failure) or [Post-bind](#post-bind) +plugins (on success). -The actual assignment of the Node to the Pod happens during the "Bind" phase. -That is when the API server updates the Pod object with the Node information. +*Note: This concept used to be referred to as "assume".* ### Permit -Permit plugins run in a separate go routine (in parallel). Each plugin can return -one of the three possible values: 1) "permit", 2) "deny", or 3) "wait". If all -plugins registered at this extension point return "permit", the pod is sent to -the next step for binding. If any of the plugins returns "deny", the pod is -rejected and sent back to the scheduling queue. If any of the plugins returns -"wait", the Pod is kept in reserved state until it is explicitly approved for -binding. A plugin that returns "wait" must return a "timeout" as well. If the -timeout expires, the pod is rejected and goes back to the scheduling queue. +These plugins are used to prevent or delay the binding of a Pod. A permit plugin +can do one of three things. -#### Approving a Pod binding +1. **approve** \ + Once all permit plugins approve a pod, it is sent for binding. -While any plugin can receive the list of reserved Pod from the cache and approve -them, we expect only the "Permit" plugins to approve binding of reserved Pods -that are in "waiting" state. Once a Pod is approved, it is sent to the Bind -stage. +1. **deny** \ + If any permit plugin denies a pod, it is returned to the scheduling queue. + This will trigger [Un-reserve](#un-reserve) plugins. -### Reject +1. **wait** (with a timeout) \ + If a permit plugin returns "wait", then the pod is kept in the permit phase + until a [plugin approves it](#pluginhandle). If a timeout occurs, **wait** + becomes **deny** and the pod is returned to the scheduling queue, triggering + [un-reserve](#un-reserve) plugins. -Plugins called at "Permit" may perform some operations that should be undone if -the Pod reservation fails. The "Reject" extension point allows such clean-up -operations to happen. Plugins registered at this point are called if the -reservation of the Pod is cancelled. The reservation is cancelled if any of the -"Permit" plugins returns "reject" or if a Pod reservation, which is in "wait" -state, times out. +**Approving a pod binding** -### Pre-Bind +While any plugin can receive the list of reserved pod from the cache and approve +them (see [`PluginHandle`](#pluginhandle)) we expect only the permit plugins to +approve binding of reserved Pods that are in "waiting" state. Once a pod is +approved, it is sent to the pre-bind phase. -When a Pod is approved for binding it reaches to this stage. These plugins run -before the actual binding of the Pod to a Node happens. The binding starts only -if all of these plugins return true. If any returns false, the Pod is rejected -and sent back to the scheduling queue. These plugins run in a separate go -routine. The same go routine runs "Bind" after these plugins when all of them -return true. +### Pre-bind + +These plugins are used to perform any work required before a pod is bound. For +example, a pre-bind plugin may provision a network volume and mount it on the +target node before allowing the pod to run there. + +If any pre-bind plugin returns an error, the pod is [rejected](#un-reserve) and +returned to the scheduling queue. ### Bind -Once all pre-bind plugins return true, the Bind plugins are executed. Multiple -plugins may be registered at this extension point. Each plugin may return true -or false (or an error). If a plugin returns false, the next plugin will be -called until a plugin returns true. Once a true is returned **the remaining -plugins are skipped**. If any of the plugins returns an error or all of them -return false, the Pod is rejected and sent back to the scheduling queue. - -### Post Bind - -The Post Bind plugins can be useful for housekeeping after a pod is scheduled. -These plugins do not return any value and are not expected to influence the -scheduling decision made in the scheduling cycle. - -### Informer Events - -The scheduling framework, similar to Scheduler v1, will have informers that let -the framework keep its copy of the state of the cluster up-to-date. The -informers generate events, such as "PodAdd", "PodUpdate", "PodDelete", etc. The -framework allows plugins to register their own handlers for any of these events. -The handlers allow plugins with internal state or caches to keep their state -updated. - -# USE-CASES - -In this section we provide a couple of examples on how the scheduling framework -can be used to solve common scheduling scenarios. - -### Dynamic binding of cluster-level resources - -Cluster level resources are resources which are not immediately available on -nodes at the time of scheduling Pods. Scheduler needs to ensure that such -cluster level resources are bound to a chosen Node before it can schedule a Pod -that requires such resources to the Node. We refer to this type of binding of -resources to Nodes at the time of scheduling Pods as dynamic resource binding. -Dynamic resource binding has proven to be a challenge in Scheduler v1, because -Scheduler v1 is not flexible enough to support various types of plugins at -different phases of scheduling. As a result, binding of storage volumes is -integrated in the scheduler code and some non-trivial changes are done to the -scheduler extender to support dynamic binding of network GPUs. -The scheduling framework allows such dynamic bindings in a cleaner way. The main -thread of scheduling framework process a pending Pod that requests a network -resource and finds a node for the Pod and reserves the Pod. A dynamic resource -binder plugin installed at "Pre-Bind" stage is invoked (in a separate thread). -It analyzes the Pod and when detects that the Pod needs dynamic binding of the -resource, the plugin tries to attach the cluster resource to the chosen node and -then returns true so that the Pod can be bound. If the resource attachment -fails, it returns false and the Pod will be retried. -When there are multiple of such network resources, each one of them installs one -"pre-bind" plugin. Each plugin looks at the Pod and if the Pod is not requesting -the resource that they are interested in, they simply return "true" for the -pod. - -### Gang Scheduling - -Gang scheduling allows a certain number of Pods to be scheduled simultaneously. -If all the members of the gang cannot be scheduled at the same time, none of -them should be scheduled. Gang scheduling may have various other features as -well, but in this context we are interested in simultaneous scheduling of Pods. -Gang scheduling in the scheduling framework can be done with an "Permit" plugin. -The main scheduling thread processes pods one by one and reserves nodes for -them. The gang scheduling plugin at the Permit stage is invoked for each pod. -When it finds that the pod belongs to a gang, it checks the properties of the -gang. If there are not enough members of the gang which are scheduled or in -"wait" state, the plugin returns "wait". When the number reaches the desired -value, all the Pods in wait state are approved and sent for binding. - -# OUT OF PROCESS PLUGINS - -Out of process plugins (OOPP) are called via JSON over an HTTP interface. In -other words, the scheduler will support webhooks at most (maybe all) of the -extension points. Data sent to an OOPP must be marshalled to JSON and data -received must be unmarshalled. So, calling an OOPP is significantly slower than -in-process plugins. -We do not plan to build OOPPs in the first version of the scheduling framework. -So, more details on them is to be determined. - - -# DEVELOPMENT PLAN - -Earlier, we wanted to develop the scheduling framework as an independent project -from scheduler V1. However, that would need much engineering resources. -It would also be more difficult to roll out a new and not fully-backward -compatible scheduler in Kubernetes where tens of thousands of users depend on -the behavior of the scheduler. -After revisiting the ideas and challenges, we changed our plan and have decided -to build some of the ideas of the scheduling framework into Scheduler V1 to make -it more extendable. - -As the first step, we would like to build: - 1. [Pre-bind](#pre-bind) and [Reserve](#reserve) plugin points. These will - help us move our existing cluster resource binding code, such as persistent - volume binding, to plugins. - 1. We will also build - [the plugin communication mechanism](#communication-and-statefulness-of-plugins). - This will allow us to build more sophisticated plugins that would require - communication and also help us clean up existing scheduler's code by removing - existing transient cache data. - -More features of the framework can be added to the Scheduler in the future based -on the requirements. - - -# CONFIGURING THE SCHEDULING FRAMEWORK - -TBD - -# BACKWARD COMPATIBILITY WITH SCHEDULER v1 - -We will build a new set of plugins for scheduler v2 to ensure that the existing -behavior of scheduler v1 in placing Pods on nodes is preserved. This includes -building plugins that replicate default predicate and priority functions of -scheduler v1 and its binding mechanism, but scheduler extenders built for -scheduler v1 won't be compatible with scheduler v2. Also, predicate and priority -functions which are not enabled by default (such as service affinity) are not -guaranteed to exist in scheduler v2. - -# DEVELOPMENT PLAN - -We will develop the scheduling framework as an incubator project in SIG -scheduling. It will be built in a separate code-base independently from -scheduler v1, but we will probably use a lot of code from scheduler v1. - -# TESTING PLAN - -We will add unit-tests as we build functionalities of the scheduling framework. -The scheduling framework should eventually be able to pass integration and e2e -tests of scheduler v1, excluding those tests that involve scheduler extensions. -The e2e and integration tests may need to be modified slightly as the -initialization and configuration of the scheduling framework will be different -than scheduler v1. - -# WORK ESTIMATES - -We expect to see an early version of the scheduling framework in two release -cycles (end of 2018). If things go well, we will start offering it as an -alternative to the scheduler v1 by the end of Q1 2019 and start the deprecation -of scheduler v1. We will make it the default scheduler of Kubernetes in Q2 2019, -but we will keep the option of using scheduler v1 for at least two more release -cycles. - +These plugins are used to bind a pod to a Node. Bind plugins will not be called +until all pre-bind plugins. Each bind plugin is called in the configured order. +A bind plugin may choose whether or not to handle the given Pod. If a bind +plugin chooses to handle a Pod, **the remaining bind plugins are skipped**. + +### Post-bind + +This is an informational extension point. Post-bind plugins are called after a +pod is successfully bound. This is the end of a scheduling cycle, and can be +used to clean up associated resources. + +### Un-reserve + +This is an informational extension point. If a pod was reserved and then +rejected in a later phase, then un-reserve plugins will be notified. Un-reserve +plugins should clean up state associated with the reserved Pod. + +Plugins that use this extension point usually should also use +[Reserve](#reserve). + +## Plugin API + +There are two steps to the plugin API. First, plugins must register and get +configured, then they use the extension point interfaces. Extension point +interfaces have the following form. + +```go +type Plugin interface { + Name() string +} + +type QueueSortPlugin interface { + Plugin + Less(*v1.Pod, *v1.Pod) bool +} + +type PreFilterPlugin interface { + Plugin + PreFilter(PluginContext, *v1.Pod) error +} + +// ... +``` + +### PluginContext + +Most* plugin functions will be called with a `PluginContext` argument. A +`PluginContext` represents the current scheduling cycle. + +A `PluginContext` provides read-only APIs for accessing the scheduler's cache of +cluster state. This is the preferred way for plugins to iterate over nodes, +iterate over pods on one node, check available resources, and other tasks. The +scheduler will provide a consistent view of the cluster through these APIs, even +if the data is a little stale. Since two scheduling cycles can overlap in time, +plugins should not assume that they will see the same data from two different +`PluginContext`s. + +The `PluginContext` also provides an API similar to +[`context.WithValue`](https://godoc.org/context#WithValue) that can be used to +pass data between plugins at different extension points. Multiple plugins can +share the state or communicate via this mechanism. The state is preserved only +during a single scheduling cycle. It is worth noting that plugins are assumed to +be **trusted**. The scheduler does not prevent one plugin from accessing or +modifying another plugin's state. + +\* *The only exception is for [queue sort](#queue-sort) plugins.* + +**WARNING**: The data available through a `PluginContext` is not valid after a +scheduling cycle ends, and plugins should not hold references to that data +longer than necessary. + +### PluginHandle + +While the `PluginContext` provides APIs relevant to a single scheduling cycle, +the `PluginHandle` provides APIs relevant to the lifetime of a plugin. +Specifically, `PluginHandle` provides a client (`kubernetes.Interface`) and +`SharedInformerFactory`. The handle will also provide APIs to list and approve +or reject [waiting pods](#permit). + +**WARNING**: `PluginHandle` provides access to both the kubernetes API server +and the scheduler's internal cache. The two are **not guaranteed to be in sync** +and extreme care should be taken when writing a plugin that uses data from both +of them. + +Providing plugins access to the API server is necessary to implement useful +features, especially when those features consume object types that the scheduler +does not normally consider. Providing a `SharedInformerFactory` allows plugins +to share caches safely. + +### Plugin Registration + +Each plugin must define a constructor and add it to the hard-coded registry. For +more information about constructor args, see [Optional Args](#optional-args). + +Example: + +```go +type PluginFactory = func(json.RawMessage, PluginHandle) (Plugin, error) + +type Registry map[string]PluginFactory + +func NewRegistry() Registry { + return Registry{ + fooplugin.Name: fooplugin.New, + barplugin.Name: barplugin.New, + // New plugins are registered here. + } +} +``` + +It is also possible to add plugins to a `Registry` object and inject that into a +scheduler. See [Custom Scheduler Plugins](#custom-scheduler-plugins-out-of-tree) + +## Plugin Lifecycle + +### Initialization + +There are two steps to plugin initialization. First, +[plugins are registered](#plugin-registration). Second, the scheduler uses its +configuration to decide which plugins to instantiate. If a plugin registers for +multiple extension points, *it is instantiated only once*. + +When a plugin is instantiated, it is passed [config args](#optional-args) and a +[`PluginHandle`](#pluginhandle). + +### Concurrency + +There are two types of concurrency that plugin writers should consider. A plugin +might be invoked several times concurrently when evaluating multiple nodes, and +a plugin may be called concurrently from *different +[scheduling cycles](#scheduling-cycle)*. + +In the main thread of the scheduler, only one scheduling cycle is processed at a +time. Any extension point up to and including [reserve](#reserve) will be +finished before the next scheduling cycle begins*. After the reserve phase, the +[permit](#permit) and [bind](#bind) phases are executed asynchronously. This +means that a plugin could be called concurrently from two different scheduling +cycles, provided that at least one of the calls is to an extension point after +reserve. Stateful plugins should take care to handle these situations. + +Finally, [un-reserve](#un-reserve) plugins may be called from either the Permit +thread or the Bind thread, depending on how the pod was rejected. + +\* *The queue sort extension point is a special case. It is not part of a +scheduling cycle and may be called concurrently for many pod pairs.* + +![image](20180409-scheduling-framework-threads.png) + +## Configuring Plugins + +The scheduler's component configuration will allow for plugins to be enabled, +disabled, or otherwise configured. Plugin configuration is separated into two +parts. + +1. A list of enabled plugins for each extension point (and the order they + should run in). If one of these lists is omitted, the default list will be + used. +1. An optional set of custom plugin arguments for each plugin. Omitting config + args for a plugin is equivalent to using the default config for that plugin. + +The plugin configuration is organized by extension points. A plugin that +registers with multiple points must be included in each list. + +```go +type KubeSchedulerConfiguration struct { + // ... other fields + Plugins Plugins + PluginConfig []PluginConfig +} + +type Plugins struct { + QueueSort []Plugin + PreFilter []Plugin + Filter []Plugin + PostFilter []Plugin + Score []Plugin + NormalizeScore []Plugin + Reserve []Plugin + Permit []Plugin + PreBind []Plugin + Bind []Plugin + PostBind []Plugin + UnReserve []Plugin +} + +type Plugin struct { + Name string + Weight int // Only valid for Score plugins +} + +type PluginConfig struct { + Name string + Args json.RawMessage +} +``` + +Example: + +```json +{ + "plugins": { + "preFilter": [ + { + "name": "PluginA" + }, + { + "name": "PluginB" + }, + { + "name": "PluginC" + } + ], + "score": [ + { + "name": "PluginA", + "weight": 30 + }, + { + "name": "PluginX" + }, + { + "name": "PluginY" + } + ] + }, + "pluginConfig": [ + { + "name": "PluginX", + "args": { + "favorite_color": "#326CE5", + "favorite_number": 7, + "thanks_to": "thockin" + } + } + ] +} +``` + +### Enable/Disable + +When specified, the list of plugins for a particular extension point are the +only ones enabled. If an extension point is omitted from the config, then the +default set of plugins is used for that extension point. + +### Change Evaluation Order + +When relevant, plugin evaluation order is specified by the order the plugins +appear in the configuration. A plugin that registers for multiple extension +points can have different ordering at each extension point. + +### Optional Args + +Plugins may receive arguments from their config with arbitrary structure. +Because one plugin may appear in multiple extension points, the config is in a +separate list of `PluginConfig`. + +For example, + +```json +{ + "name": "ServiceAffinity", + "args": { + "LabelName": "app", + "LabelValue": "mysql" + } +} +``` + +```go +func NewServiceAffinity(args json.RawMessage, h PluginHandle) (Plugin, error) { + var config struct { + LabelName, LabelValue string + } + if err := json.Unmarshal(args, &config); err != nil { + return nil, errors.Wrap(err, "could not parse args") + } + //... +} +``` + +### Backward compatibility + +The current `KubeSchedulerConfiguration` kind has `apiVersion: +kubescheduler.config.k8s.io/v1alpha1`. This new config format will be either +`v1alpha2` or `v1beta1`. When a newer version of the scheduler parses a +`v1alpha1`, the "policy" section will be used to construct an equivalent plugin +configuration. + +*Note: Moving `KubeSchedulerConfiguration` to `v1` is outside the scope of this +design, but see also +https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/0032-create-a-k8s-io-component-repo.md +and https://github.com/kubernetes/community/pull/3008* + +## Interactions with Cluster Autoscaler + +TODO + +# USE CASES + +These are just a few examples of how the scheduling framework can be used. + +## Coscheduling + +Functionality similar to +[kube-batch](https://github.com/kubernetes-sigs/kube-batch) (sometimes called +"gang scheduling") could be implemented as a plugin. For pods in a batch, the +plugin would "accumulate" pods in the [permit](#permit) phase by using the +"wait" option. Because the permit stage happens after [reserve](#reserve), +subsequent pods will be scheduled as if the waiting pod is using those +resources. Once enough pods from the batch are waiting, they can all be +approved. + +## Dynamic Resource Binding + +[Topology-Aware Volume Provisioning](https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/) +can be (re)implemented as a plugin that registers for [filter](#filter) and +[pre-bind](#pre-bind) extension points. At the filtering phase, the plugin can +ensure that the pod will be scheduled in a zone which is capable of provisioning +the desired volume. Then at the pre-bind phase, the plugin can provision the +volume before letting scheduler bind the pod. + +## Custom Scheduler Plugins (out of tree) + +The scheduling framework allows people to write custom, performant scheduler +features without forking the scheduler's code. To accomplish this, developers +just need to write their own `main()` wrapper around the scheduler. Because +plugins must be compiled with the scheduler, writing a wrapper around `main()` +is necessary in order to avoid modifying code in `vendor/k8s.io/kubernetes`. + +```go +import ( + "k8s.io/kubernetes/pkg/scheduler/plugins" + scheduler "k8s.io/kubernetes/cmd/kube-scheduler/app" +) + +func main() { + registry := plugins.NewRegistry() + registry.Add("MyPlugin", NewMyPlugin) + scheduler.Main(registry) +} +``` + +*Note: The above code is an example, and might not match the implemented API.* + +The custom plugin would be enabled in the scheduler config. + +```json +{ + "name": "MyPlugin" +} +``` + +# GRADUATION CRITERIA + +TODO + +# IMPLEMENTATION HISTORY + +TODO: write down milestones and target releases, and a plan for how we will +gracefully move to the new system + + + + + + +