-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler extension Proposal #11470
Comments
You gave the example of "put a pod where its storage is". This could be implemented via a node selector on the pod? |
Speaking of labels, we'll need a way for Kubelets to export
with these attributes/resources provided by a configuration file and/or plugin -- supporting both file (which kubelet would monitor for changes) and http would likely be adequate. With respect to co-location and other commonly desired features, attempting to arrive at a general solution would be more desirable than hiding the behavior in an extension. As for actual extension of the scheduler, there are a couple possible approaches:
The current cloudprovider API isn't a good model -- see #2770 and #10503. I do want to support multiple schedulers, including user-provided schedulers. The main thing we need is a way to not apply the default scheduling behavior, perhaps by namespace, using some sort of "admission control" plugin. I want that to be a fairly generic mechanism, since I expect to need it at least for horizontal auto-scaling and vertical auto-sizing, as well, if not other things in the future. This is related to the "initializers" topic (#3585). A custom scheduler would then need to be configured for which pods to schedule. This probably requires adding information of some form to the pods. We should take that into account when thinking about the general initializer mechanism. The custom scheduler should then be able to watch all pods and nodes to keep its state up to date, but I imagine we could add more information, events, or something to make this more convenient, as requested in #1517. |
Also, I think the scheduler is a candidate for moving to its own github project, once we figure out how to manage testing, releases, etc. |
In this case, the pod user doesn't need to know where the volume is actually deployed. Also, it may not be a predicate, but a priority function that prefers to deploy the pod on the host where the volume physically resides. There are a few more examples I can think of: @bgrant0607 For our use case, we have qos, network, storage requirements that affect scheduling. A different vendor may have different requirements. If there is a way to generically extend the scheduler, it will help everyone. We would also like to be able to use the stock kubernetes distribution (through one of the many OS vendors who support it) rather than ship our own scheduler. Hence the push for extensibility. Worth discussing tomorrow or face/face meeting or at Linuxcon? |
Aspects of this topic were discussed earlier in #9920 |
How about the August 7 community hangout? |
@bgrant0607 |
I agree with @bgrant0607 that writing your own scheduler is probably a better direction. We should try to make the default Kubernetes scheduler factored in a way that people who want to write their own schedulers can import/reuse the core parts as a library, otherwise there will be a lot of code duplication across all the schedulers people write. @lavalamp 's controller framework and modeler are steps towards that, in a sense. But it would be great to hear more about your thoughts. |
Thanks for sharing your thoughts. Technically, either approach works. As I mentioned earlier in the thread, its more for business reasons we would like to use the stock kubernetes distribution from one of the OS vendors. They sell the support for the distro. If we ship our own scheduler, it violates that support model. I will post some prototype changes in the next couple of days to this thread. The scope of change to the scheduler is minimal. Hope that alleviates some of the concerns. Will discuss the rest hopefully on a call/in person. |
An extensible scheduler has value beyond a third-party support model. Being able to plug more flexible placement information to an existing Kubernetes deployment where the default scheduler is doing a fine job otherwise saves rework and porting/updating every time the default scheduler (or the plugin) changes. Having many different people release their own schedulers (even with libraries) seems to defeat the point of a common project that takes the best of what everyone has to offer and instead makes people choose all-or-nothing with a monolithic scheduler. There may be valid concerns over destabilizing the scheduler with non-deterministic responses. We can overcome these as other projects have done with well defined interfaces. The plugin is responsible for adhering to the interface and any response criteria that are established. Are there any specific concerns with the concept of an extensible scheduler or is this more a matter of priority? |
Sample code here - ravigadde/kube-scheduler@23ad25b Please let me know your thoughts. |
Sorry, @ravigadde. Still digging out of the pre-1.0 backlog. I have a few quick comments:
|
Thanks for your comments.
|
@ravigadde Please correct me if I'm wrong -- IIUC, your PR allows the scheduler to outsource "Prioritize" and "Filter" operations (i.e. priority functions and predicates) to an external process which it contacts via HTTP (this other process also has endpoints for "bind" and "unbind" so it can be informed of changes in cluster state--BTW I'm not sure this is sufficient since it also needs to know each machine's capacity, labels, etc.). Is that correct? If that's correct, I'm not understanding why this is considered to be using "stock kubernetes distribution" whereas writing your own scheduler is not. The only difference I see is the direction of the communication -- in your model, Kubernetes scheduler calls out to your scheduler, while in the model @bgrant0607 and I were discussing above (see also #11793), your scheduler calls into Kubernetes to watch state and post bindings. That said, I guess there isn't any reason why we couldn't have the scheduler call out to another process (modulo performance concerns). But I agree with @bgrant0607 that there's no reason to use the cloud provider interface -- you could probably just configure the identity of the remote endpoint using the scheduler config file (see plugin/pkg/scheduler/api/). |
@davidopp Yes, that is correct. The other process that is handling these calls is aware of node resources/labels. It watches apiserver for any changes. There are subtle differences in the two models (mostly non-technical). The proposal aims to address the above issues. I originally intended for apiserver to invoke Unbind on pod deletion so the resources associated with the pod can be cleaned up. Hence needed an interface that can be shared by both. But this could also be achieved by watching the apiserver for pod deletion. Will go with your suggestion of using the scheduler config file and create a PR. Please let me know if anything is not clear, we can discuss in the call tomorrow. |
@ravigadde I'd be happy to discuss with you offline. Send me an email at (my github username) @ google.com and we can arrange a time. |
I don't see how this proposal addresses your expressed concerns. a) Your custom scheduling logic would be in a non-maintained/supported binary. As for this specific API: We need to be able to run the "fit" check in several places. In addition to the scheduler, we already also run it in Kubelet, and have discussed also running it in apiserver, upon calls to /binding. We're not going to call out to another endpoint in all those places. Also, while we need to support the current scheduler configuration for some time, we anticipate creating a new approach prioritization, and would like to avoid creating hard-to-remove dependencies upon the current approach. Notification of binds and unbinds is insufficient to communicate the state of the cluster. How would the scheduler extension get the initial state? How would it update its state after an outage of the apiserver? What should be done if the scheduler were unable to contact your extension? The logic needs to be "level-based". https://github.com/kubernetes/kubernetes/blob/master/docs/design/principles.md#control-logic Is there something we could do to make it easier to fork and extend the scheduler, such as refactoring more of it into reusable libraries / frameworks? We could also investigate how to make it easier to keep a scheduler's state up to date using get/watch: #1517. That will likely happen as a consequence of increasing the amount of caching done in the scheduler. |
Meeting summary:
|
Meeting summary:
In the proposal, or our API? |
@smarterclayton - In the proposal. |
imho this whole proposal sounds like what you really want is #17197 |
No it isn't. I will explain in the other thread. |
Kubernetes scheduler schedules based on resources managed by Kubernetes. Scheduling based on opaque resource counting helps extending this further. But when there is a need for contextual scheduling for resources managed outside of kubernetes(example: place a pod where its storage is), there is no mechanism to do it today.
The proposal is to make kubernetes scheduler extensible by adding the capability to make http calls out to another endpoint to help achieve this functionality. I am curious whether you think the cloud provider abstraction is the right abstraction for implementation.
Here is a rough draft of what I am thinking about. Would like to solicit community feedback
The text was updated successfully, but these errors were encountered: