-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jupyter - Culling of Idle Pods #1394
Comments
It looks like this information is also available from the Jupyter servers themselves. I think its last_activity I mention this because we are considering maybe getting rid of JupyterHub #1630 |
A bit more info on notebook self-culling vs cull-idle-server: #56 (comment) the former is also useful in non-jupyterhub scenario if we ever get there. Note this needs to have the user pod restartPolicy set to OnFailure, which was broken in kubespawner until jupyterhub/kubespawner#251 was merged. |
Adding this to 0.6. It would be great to make this an option in JupyterNotebook Controller and support it in the UI. For example, it might be nice to terminate idle Jupyter pods without deleting the Jupyter resource so that from the UI a user could restart it with 1 click. |
cc |
@jlewi we would like to help driving this forward |
We should try to get this into the controller in 0.7 in preparation for going 1.0. |
@jlewi yes, I was planning to work on this one for 0.7 The overall experience could be the following. The UI would support One thing we need to clarify is, what is an idle Pod. If we take for granted that ISTIO would be installed in the cluster so we could use logs from Prometheus, then an idle Pod could be a Pod that doesn't receive http traffic for a For this UX, the changes we would need to make would be: Notebook Controller:
Notebook CullerIt will be a sidecar Container in the Notebook controller. It will periodically query Prometheus for the existing Notebook CRs and stop any idle Notebooks. For stopping a Notebook, it will make a PATCH to the CR and set the number of Pods to zero. Notebooks Manager UI
@jlewi WDYT |
Why does the culling logic need to be in a side car? Why can't we just make this a part of the regular notebook controller logic and invoke it in the reconcile function for each notebook? |
I was thinking that we might need this/similar logic also for the Viewer CRD. So instead of repeating the logic in the Viewer Controller we could have it in a central place. Then we could launch the Culling Container as a Sidecar to the Controllers and only change the culler's configuration for each case. Regarding the Viewer CR, right now the controller has a maximum number of allowed live instances. If a new Viewer is created and the total number exceeds this maximum, then an oldest one will be deleted. Wouldn't we want to have a similar culling logic here? |
@kimwnasptd If we need similar logic in other CRDs; can't we just make it a go library and import that into the viewer controller? |
@jlewi yes, that would also work. I don't have any hard preference as long as we make the code reusable. Regarding making this part of the Reconciliation loop. We will be making an HTTP request for each Notebook to the Prometheus server, in But if you think this shouldn't be much of a problem, then I can just make it a go package and trigger the logic on the |
I think we should be fine with having this in the reconcile function. We can also set to re-trigger the reconcile with specific time intervals. Opened a PR to move forward with the implementation. |
Per the discussion in #3856 I think we might try to use the IDLE signal that the Jupyter kernel provides. |
What about automatic unidling? Both RedHat (in OpenShift) and Deis have idling/unidling controllers that can scale a resource down to zero, then scale it back up when new requests come in, with a proxy forwarding any requests that arrived while no pods were up. See e.g. https://github.com/deislabs/osiris |
@therc If people want to unidle they would do that through the UI. @kimwnasptd Any update on this? |
@kimwnasptd any update on this? Since notebooks are going beta in 0.7 and we don't want more API changes after that; that likely means specific notebook idle times will need to be set via annotation. |
@kimwnasptd Could you provide an update on this specifically what do you think the 1.0 behavior will be? |
Sure! For 1.0 I was thinking to have 2-3 By default JWA won't have any Also, the UI will allow the user to start a culled Notebook or stop a running one. I want to ensure this functionality for 1.0 and if I have the time we can re-iterate and add more functionality for 1.0. @jlewi WDYT? |
@kimwnasptd Why would we use PodDefaults? Why not just add a field in the UI that allows people to set the culling time? |
@jlewi I was thinking that most of the time the users will be selecting the same culling times for their different Notebook classes. Most of the time they would want their GPU Notebooks to be up as little as possible and could maybe have two or three different culling times for their normal Notebooks. Because of this I though it would make more sense to have a drop-down for users to select a predefined culling time rather than setting it directly. This drop-down would also need to be configurable in jwa's config from the admin. Both of these two could be satisfied with PodDefaults so I preferred it. Do you think this behavior should be more visible and have its own section? |
@kimwnasptd Thank for the explanation. Can we add docs and examples for PodPresets that people can use to setup this behavior? Then we can close this issue out. |
Is there any action on this? Is there any ability in Kubeflow to cull idle pods? We want the ability to shut down Jupyter notebooks after a certain period of idle time. |
It looks like this may have been added in: #3856 But I am not sure we should be tightly integrating the status check with Jupyter itself, (so we can support other notebook types). |
seeing issues with the changes merged from #3856 . the activity check at ../api/status fails with a 403 error: RBAC: access denied. and seeing log entries in the notebook controller: after enabling the |
Got the same issue. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
@kimwnasptd thoughts? |
i got around this with a simple hack in this case. setting the kubeflow-userid header to a value in the culler allowed the request to complete. |
Could you provide more details to your hack :) ? |
I consider it a hack, because I didn't have time to fully understand enough to fix it correctly. I assume this works because the istio service mesh and or dex is expecting this in the header - even though the api status call is un-secured. git diff.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
* Add waitAllProcesses to metricsCollector config * wait_all_processes config for python metricscollector main * Correct boolean check * waitAllProcesses config as bool * add omitempty to suggestion and earlystopping * correct default config
* Moving central dashboard links to configmap * Updating tests for centraldashboard configmap
JupyterHub has a demo on culling idle pods using a script.
This is present as an example on JupyterHub service
I have a somewhat ghetto solution to implementing this within Kubeflow using configmap
We add the following configuration into the Config Map
jupyterhub-config
:The solution works by having JupyterHub start two managed services: one to fetch the
cull_idle_servers.py
file and one to start it. Hope this is useful!I'm actually wondering if we should start a specific section in the documentation that deals with configuration easily done through configmap (like this one).
We could talk about things like form customisation, adding more field (like node selection via labels) etc. I feel there's a lot of potential there.
I will be glad to contribute to that documentation.
The text was updated successfully, but these errors were encountered: