Jupyter - Culling of Idle Pods #1394

tlkh · 2018-08-21T16:41:48Z

JupyterHub has a demo on culling idle pods using a script.
This is present as an example on JupyterHub service
I have a somewhat ghetto solution to implementing this within Kubeflow using configmap

We add the following configuration into the Config Map jupyterhub-config:

c.JupyterHub.services = [
    {
        'name': 'wget-cull-idle',
        'admin': True,
        'command': ['wget', 'https://raw.githubusercontent.com/jupyterhub/jupyterhub/master/examples/cull-idle/cull_idle_servers.py', '-N']
    },

    {
        'name': 'cull-idle',
        'admin': True,
        'command': ['python', 'cull_idle_servers.py', '--timeout=3600']
    }
]

The solution works by having JupyterHub start two managed services: one to fetch the cull_idle_servers.py file and one to start it. Hope this is useful!

I'm actually wondering if we should start a specific section in the documentation that deals with configuration easily done through configmap (like this one).
We could talk about things like form customisation, adding more field (like node selection via labels) etc. I feel there's a lot of potential there.
I will be glad to contribute to that documentation.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-09-27T19:30:23Z

It looks like this information is also available from the Jupyter servers themselves. I think its last_activity
http://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyter/notebook/master/notebook/services/api/api.yaml#/kernels/get_api_kernels__kernel_id_

I mention this because we are considering maybe getting rid of JupyterHub #1630

clkao · 2018-09-30T02:10:12Z

A bit more info on notebook self-culling vs cull-idle-server: #56 (comment) the former is also useful in non-jupyterhub scenario if we ever get there.

Note this needs to have the user pod restartPolicy set to OnFailure, which was broken in kubespawner until jupyterhub/kubespawner#251 was merged.

jlewi · 2019-03-28T17:37:32Z

Adding this to 0.6.

It would be great to make this an option in JupyterNotebook Controller and support it in the UI.
A CUJ might be helpful.

For example, it might be nice to terminate idle Jupyter pods without deleting the Jupyter resource so that from the UI a user could restart it with 1 click.

/cc @vkoukis @zabbasi

lluunn · 2019-03-28T17:51:42Z

cc

kimwnasptd · 2019-05-02T16:26:17Z

@jlewi we would like to help driving this forward
/assign kimwnasptd

jlewi · 2019-07-23T01:52:41Z

We should try to get this into the controller in 0.7 in preparation for going 1.0.

kimwnasptd · 2019-07-23T13:56:42Z

@jlewi yes, I was planning to work on this one for 0.7

The overall experience could be the following. The UI would support stopping/starting of the Notebook Pods and we will have a culler, as a sidecar container to the Controller, which would stop idle Notebooks. The user can then just start a stopped Notebook from the UI.

One thing we need to clarify is, what is an idle Pod. If we take for granted that ISTIO would be installed in the cluster so we could use logs from Prometheus, then an idle Pod could be a Pod that doesn't receive http traffic for a {configurable} amount of time.

For this UX, the changes we would need to make would be:

Notebook Controller:

Add a field to the Notebook Spec to specify the number of Pods to be created. We need this to specify the Pods in the StatefulSet to be zero -> stopping the Notebook

Notebook Culler

It will be a sidecar Container in the Notebook controller. It will periodically query Prometheus for the existing Notebook CRs and stop any idle Notebooks. For stopping a Notebook, it will make a PATCH to the CR and set the number of Pods to zero.

Notebooks Manager UI

Give the option to users to stop/start a Notebook. Under the hood the backend will also modify the number of Pods in the CR

@jlewi WDYT

jlewi · 2019-07-30T03:40:10Z

Why does the culling logic need to be in a side car?

Why can't we just make this a part of the regular notebook controller logic and invoke it in the reconcile function for each notebook?

kimwnasptd · 2019-07-30T11:56:48Z

I was thinking that we might need this/similar logic also for the Viewer CRD. So instead of repeating the logic in the Viewer Controller we could have it in a central place. Then we could launch the Culling Container as a Sidecar to the Controllers and only change the culler's configuration for each case.

Regarding the Viewer CR, right now the controller has a maximum number of allowed live instances. If a new Viewer is created and the total number exceeds this maximum, then an oldest one will be deleted. Wouldn't we want to have a similar culling logic here?

jlewi · 2019-08-06T13:11:18Z

@kimwnasptd If we need similar logic in other CRDs; can't we just make it a go library and import that into the viewer controller?

kimwnasptd · 2019-08-07T15:03:10Z

@jlewi yes, that would also work. I don't have any hard preference as long as we make the code reusable.

Regarding making this part of the Reconciliation loop. We will be making an HTTP request for each Notebook to the Prometheus server, in istio-system. Since the Reconcile function will be called with a high frequency, it would produce a good amount of requests. Thus, this is why I was thinking of making it a standalone process/container that would be triggered periodically.

But if you think this shouldn't be much of a problem, then I can just make it a go package and trigger the logic on the Reconcile function of the controller.

kimwnasptd · 2019-08-09T15:48:03Z

I think we should be fine with having this in the reconcile function. We can also set to re-trigger the reconcile with specific time intervals. Opened a PR to move forward with the implementation.

jlewi · 2019-08-18T22:22:26Z

Per the discussion in #3856 I think we might try to use the IDLE signal that the Jupyter kernel provides.

therc · 2019-08-29T23:11:18Z

What about automatic unidling? Both RedHat (in OpenShift) and Deis have idling/unidling controllers that can scale a resource down to zero, then scale it back up when new requests come in, with a proxy forwarding any requests that arrived while no pods were up.

See e.g. https://github.com/deislabs/osiris

jlewi · 2019-09-10T13:24:33Z

@therc If people want to unidle they would do that through the UI.

@kimwnasptd Any update on this?

jlewi · 2019-10-05T01:30:51Z

@kimwnasptd any update on this?

Since notebooks are going beta in 0.7 and we don't want more API changes after that; that likely means specific notebook idle times will need to be set via annotation.

jlewi · 2019-12-16T17:39:16Z

@kimwnasptd Could you provide an update on this specifically what do you think the 1.0 behavior will be?

kimwnasptd · 2019-12-17T15:11:47Z

Sure! For 1.0 I was thinking to have 2-3 PodDefaults that will represent different culling behaviors. The user will then be able to select witch one of them they want to apply to a new Notebook that they create.

By default JWA won't have any PodDefaults preselected (although the admin can change that in JWA's ConfigMap). This way new Notebooks will not be culled, unless the user explicitly requests it, and also the culling behavior will be tailored on a per Notebook basis.

Also, the UI will allow the user to start a culled Notebook or stop a running one.

I want to ensure this functionality for 1.0 and if I have the time we can re-iterate and add more functionality for 1.0. @jlewi WDYT?

jlewi · 2020-01-06T14:54:06Z

@kimwnasptd Why would we use PodDefaults? Why not just add a field in the UI that allows people to set the culling time?

kimwnasptd · 2020-01-07T09:03:46Z

@jlewi I was thinking that most of the time the users will be selecting the same culling times for their different Notebook classes. Most of the time they would want their GPU Notebooks to be up as little as possible and could maybe have two or three different culling times for their normal Notebooks.

Because of this I though it would make more sense to have a drop-down for users to select a predefined culling time rather than setting it directly. This drop-down would also need to be configurable in jwa's config from the admin.

Both of these two could be satisfied with PodDefaults so I preferred it. Do you think this behavior should be more visible and have its own section?

jlewi · 2020-01-27T14:28:16Z

@kimwnasptd Thank for the explanation. Can we add docs and examples for PodPresets that people can use to setup this behavior? Then we can close this issue out.

nsshah1288 · 2020-04-06T14:08:19Z

Is there any action on this? Is there any ability in Kubeflow to cull idle pods? We want the ability to shut down Jupyter notebooks after a certain period of idle time.

thesuperzapper · 2020-04-07T07:31:41Z

It looks like this may have been added in: #3856

But I am not sure we should be tightly integrating the status check with Jupyter itself, (so we can support other notebook types).

jshelman · 2020-04-11T01:53:18Z

seeing issues with the changes merged from #3856 . the activity check at ../api/status fails with a 403 error: RBAC: access denied.

and seeing log entries in the notebook controller:
020-04-11T01:50:31.165Z INFO culler Warning: GET to http://testnb.namespace.svc.cluster.local/notebook/namespace/testnb/api/status: 403

after enabling the
ENABLE_CULLING: true. as an env for the deployment.

descampsk · 2020-06-01T08:54:40Z

Got the same issue.

issue-label-bot · 2020-06-01T08:54:48Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/question	0.65

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi · 2020-06-02T02:39:04Z

@kimwnasptd thoughts?

jshelman · 2020-06-03T00:57:26Z

i got around this with a simple hack in this case. setting the kubeflow-userid header to a value in the culler allowed the request to complete.

descampsk · 2020-06-03T08:06:01Z

Could you provide more details to your hack :) ?

jshelman · 2020-06-04T01:59:41Z

I consider it a hack, because I didn't have time to fully understand enough to fix it correctly. I assume this works because the istio service mesh and or dex is expecting this in the header - even though the api status call is un-secured.

git diff.

+++ b/components/notebook-controller/pkg/culler/culler.go
@@ -142,7 +142,15 @@ func getNotebookApiStatus(nm, ns string) *NotebookStatus {
                "http://%s.%s.svc.%s/notebook/%s/%s/api/status",
                nm, ns, domain, ns, nm)

-       resp, err := client.Get(url)
+       req, err := http.NewRequest("GET", url, nil)
+       if err != nil {
+               log.Info(fmt.Sprintf("Error creating request to %s", url), "error", err)
+               return nil
+       }
+
+       req.Header.Set("kubeflow-userid", "user@email.com")
+
+       resp, err := client.Do(req)
        if err != nil {

stale · 2020-09-02T03:51:47Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

* Add waitAllProcesses to metricsCollector config * wait_all_processes config for python metricscollector main * Correct boolean check * waitAllProcesses config as bool * add omitempty to suggestion and earlystopping * correct default config

* Moving central dashboard links to configmap * Updating tests for centraldashboard configmap

pdmack mentioned this issue Sep 4, 2018

Make JupyterHub More Configurable #56

Closed

jlewi added area/jupyter Issues related to Jupyter priority/p1 labels Mar 28, 2019

jlewi changed the title ~~JupyterHub - Culling of Idle Pods~~ Jupyter - Culling of Idle Pods Mar 28, 2019

k8s-ci-robot assigned kimwnasptd May 2, 2019

zabbasi mentioned this issue Jun 4, 2019

notebook controller Add CRD validation #2417

Closed

4 tasks

kimwnasptd mentioned this issue Aug 9, 2019

Culling of Idle Jupyter Notebooks #3856

Merged

jlewi added the kind/feature label Aug 10, 2019

jlewi mentioned this issue Sep 4, 2019

notebook controller 1.0 #3656

Closed

issue-label-bot bot added the kind/question label Jun 1, 2020

stale bot added the lifecycle/stale label Sep 2, 2020

stale bot closed this as completed Sep 10, 2020

d-gol mentioned this issue Dec 17, 2021

Notebook culling not working #6253

Closed

surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022

Moving central dashboard links to configmap (kubeflow#1394)

3c8f0a4

* Moving central dashboard links to configmap * Updating tests for centraldashboard configmap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupyter - Culling of Idle Pods #1394

Jupyter - Culling of Idle Pods #1394

tlkh commented Aug 21, 2018 •

edited

Loading

jlewi commented Sep 27, 2018

clkao commented Sep 30, 2018

jlewi commented Mar 28, 2019

lluunn commented Mar 28, 2019

kimwnasptd commented May 2, 2019

jlewi commented Jul 23, 2019

kimwnasptd commented Jul 23, 2019

jlewi commented Jul 30, 2019

kimwnasptd commented Jul 30, 2019

jlewi commented Aug 6, 2019

kimwnasptd commented Aug 7, 2019

kimwnasptd commented Aug 9, 2019

jlewi commented Aug 18, 2019

therc commented Aug 29, 2019

jlewi commented Sep 10, 2019

jlewi commented Oct 5, 2019

jlewi commented Dec 16, 2019

kimwnasptd commented Dec 17, 2019 •

edited

Loading

jlewi commented Jan 6, 2020

kimwnasptd commented Jan 7, 2020

jlewi commented Jan 27, 2020

nsshah1288 commented Apr 6, 2020

thesuperzapper commented Apr 7, 2020

jshelman commented Apr 11, 2020

descampsk commented Jun 1, 2020

issue-label-bot bot commented Jun 1, 2020

jlewi commented Jun 2, 2020

jshelman commented Jun 3, 2020

descampsk commented Jun 3, 2020

jshelman commented Jun 4, 2020 •

edited

Loading

stale bot commented Sep 2, 2020

Jupyter - Culling of Idle Pods #1394

Jupyter - Culling of Idle Pods #1394

Comments

tlkh commented Aug 21, 2018 • edited Loading

jlewi commented Sep 27, 2018

clkao commented Sep 30, 2018

jlewi commented Mar 28, 2019

lluunn commented Mar 28, 2019

kimwnasptd commented May 2, 2019

jlewi commented Jul 23, 2019

kimwnasptd commented Jul 23, 2019

Notebook Controller:

Notebook Culler

Notebooks Manager UI

jlewi commented Jul 30, 2019

kimwnasptd commented Jul 30, 2019

jlewi commented Aug 6, 2019

kimwnasptd commented Aug 7, 2019

kimwnasptd commented Aug 9, 2019

jlewi commented Aug 18, 2019

therc commented Aug 29, 2019

jlewi commented Sep 10, 2019

jlewi commented Oct 5, 2019

jlewi commented Dec 16, 2019

kimwnasptd commented Dec 17, 2019 • edited Loading

jlewi commented Jan 6, 2020

kimwnasptd commented Jan 7, 2020

jlewi commented Jan 27, 2020

nsshah1288 commented Apr 6, 2020

thesuperzapper commented Apr 7, 2020

jshelman commented Apr 11, 2020

descampsk commented Jun 1, 2020

issue-label-bot bot commented Jun 1, 2020

jlewi commented Jun 2, 2020

jshelman commented Jun 3, 2020

descampsk commented Jun 3, 2020

jshelman commented Jun 4, 2020 • edited Loading

stale bot commented Sep 2, 2020

tlkh commented Aug 21, 2018 •

edited

Loading

kimwnasptd commented Dec 17, 2019 •

edited

Loading

jshelman commented Jun 4, 2020 •

edited

Loading