Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jupyter - Culling of Idle Pods #1394

Closed
tlkh opened this issue Aug 21, 2018 · 34 comments
Closed

Jupyter - Culling of Idle Pods #1394

tlkh opened this issue Aug 21, 2018 · 34 comments

Comments

@tlkh
Copy link

tlkh commented Aug 21, 2018

JupyterHub has a demo on culling idle pods using a script.
This is present as an example on JupyterHub service
I have a somewhat ghetto solution to implementing this within Kubeflow using configmap

We add the following configuration into the Config Map jupyterhub-config:

c.JupyterHub.services = [
    {
        'name': 'wget-cull-idle',
        'admin': True,
        'command': ['wget', 'https://raw.githubusercontent.com/jupyterhub/jupyterhub/master/examples/cull-idle/cull_idle_servers.py', '-N']
    },

    {
        'name': 'cull-idle',
        'admin': True,
        'command': ['python', 'cull_idle_servers.py', '--timeout=3600']
    }
]

The solution works by having JupyterHub start two managed services: one to fetch the cull_idle_servers.py file and one to start it. Hope this is useful!

I'm actually wondering if we should start a specific section in the documentation that deals with configuration easily done through configmap (like this one).
We could talk about things like form customisation, adding more field (like node selection via labels) etc. I feel there's a lot of potential there.
I will be glad to contribute to that documentation.

@jlewi
Copy link
Contributor

jlewi commented Sep 27, 2018

It looks like this information is also available from the Jupyter servers themselves. I think its last_activity
http://petstore.swagger.io/?url=https://raw.githubusercontent.com/jupyter/notebook/master/notebook/services/api/api.yaml#/kernels/get_api_kernels__kernel_id_

I mention this because we are considering maybe getting rid of JupyterHub #1630

@clkao
Copy link

clkao commented Sep 30, 2018

A bit more info on notebook self-culling vs cull-idle-server: #56 (comment) the former is also useful in non-jupyterhub scenario if we ever get there.

Note this needs to have the user pod restartPolicy set to OnFailure, which was broken in kubespawner until jupyterhub/kubespawner#251 was merged.

@jlewi jlewi added area/jupyter Issues related to Jupyter priority/p1 labels Mar 28, 2019
@jlewi jlewi changed the title JupyterHub - Culling of Idle Pods Jupyter - Culling of Idle Pods Mar 28, 2019
@jlewi
Copy link
Contributor

jlewi commented Mar 28, 2019

Adding this to 0.6.

It would be great to make this an option in JupyterNotebook Controller and support it in the UI.
A CUJ might be helpful.

For example, it might be nice to terminate idle Jupyter pods without deleting the Jupyter resource so that from the UI a user could restart it with 1 click.

/cc @vkoukis @zabbasi

@lluunn
Copy link
Contributor

lluunn commented Mar 28, 2019

cc

@kimwnasptd
Copy link
Member

@jlewi we would like to help driving this forward
/assign kimwnasptd

@jlewi
Copy link
Contributor

jlewi commented Jul 23, 2019

We should try to get this into the controller in 0.7 in preparation for going 1.0.

@kimwnasptd
Copy link
Member

@jlewi yes, I was planning to work on this one for 0.7

The overall experience could be the following. The UI would support stopping/starting of the Notebook Pods and we will have a culler, as a sidecar container to the Controller, which would stop idle Notebooks. The user can then just start a stopped Notebook from the UI.

One thing we need to clarify is, what is an idle Pod. If we take for granted that ISTIO would be installed in the cluster so we could use logs from Prometheus, then an idle Pod could be a Pod that doesn't receive http traffic for a {configurable} amount of time.

For this UX, the changes we would need to make would be:

Notebook Controller:

  1. Add a field to the Notebook Spec to specify the number of Pods to be created. We need this to specify the Pods in the StatefulSet to be zero -> stopping the Notebook

Notebook Culler

It will be a sidecar Container in the Notebook controller. It will periodically query Prometheus for the existing Notebook CRs and stop any idle Notebooks. For stopping a Notebook, it will make a PATCH to the CR and set the number of Pods to zero.

Notebooks Manager UI

  1. Give the option to users to stop/start a Notebook. Under the hood the backend will also modify the number of Pods in the CR

@jlewi WDYT

@jlewi
Copy link
Contributor

jlewi commented Jul 30, 2019

Why does the culling logic need to be in a side car?

Why can't we just make this a part of the regular notebook controller logic and invoke it in the reconcile function for each notebook?

@kimwnasptd
Copy link
Member

I was thinking that we might need this/similar logic also for the Viewer CRD. So instead of repeating the logic in the Viewer Controller we could have it in a central place. Then we could launch the Culling Container as a Sidecar to the Controllers and only change the culler's configuration for each case.

Regarding the Viewer CR, right now the controller has a maximum number of allowed live instances. If a new Viewer is created and the total number exceeds this maximum, then an oldest one will be deleted. Wouldn't we want to have a similar culling logic here?

@jlewi
Copy link
Contributor

jlewi commented Aug 6, 2019

@kimwnasptd If we need similar logic in other CRDs; can't we just make it a go library and import that into the viewer controller?

@kimwnasptd
Copy link
Member

@jlewi yes, that would also work. I don't have any hard preference as long as we make the code reusable.

Regarding making this part of the Reconciliation loop. We will be making an HTTP request for each Notebook to the Prometheus server, in istio-system. Since the Reconcile function will be called with a high frequency, it would produce a good amount of requests. Thus, this is why I was thinking of making it a standalone process/container that would be triggered periodically.

But if you think this shouldn't be much of a problem, then I can just make it a go package and trigger the logic on the Reconcile function of the controller.

@kimwnasptd
Copy link
Member

I think we should be fine with having this in the reconcile function. We can also set to re-trigger the reconcile with specific time intervals. Opened a PR to move forward with the implementation.

@jlewi
Copy link
Contributor

jlewi commented Aug 18, 2019

Per the discussion in #3856 I think we might try to use the IDLE signal that the Jupyter kernel provides.

@therc
Copy link

therc commented Aug 29, 2019

What about automatic unidling? Both RedHat (in OpenShift) and Deis have idling/unidling controllers that can scale a resource down to zero, then scale it back up when new requests come in, with a proxy forwarding any requests that arrived while no pods were up.

See e.g. https://github.com/deislabs/osiris

@jlewi
Copy link
Contributor

jlewi commented Sep 10, 2019

@therc If people want to unidle they would do that through the UI.

@kimwnasptd Any update on this?

@jlewi
Copy link
Contributor

jlewi commented Oct 5, 2019

@kimwnasptd any update on this?

Since notebooks are going beta in 0.7 and we don't want more API changes after that; that likely means specific notebook idle times will need to be set via annotation.

@jlewi
Copy link
Contributor

jlewi commented Dec 16, 2019

@kimwnasptd Could you provide an update on this specifically what do you think the 1.0 behavior will be?

@kimwnasptd
Copy link
Member

kimwnasptd commented Dec 17, 2019

Sure! For 1.0 I was thinking to have 2-3 PodDefaults that will represent different culling behaviors. The user will then be able to select witch one of them they want to apply to a new Notebook that they create.

By default JWA won't have any PodDefaults preselected (although the admin can change that in JWA's ConfigMap). This way new Notebooks will not be culled, unless the user explicitly requests it, and also the culling behavior will be tailored on a per Notebook basis.

Also, the UI will allow the user to start a culled Notebook or stop a running one.

I want to ensure this functionality for 1.0 and if I have the time we can re-iterate and add more functionality for 1.0. @jlewi WDYT?

@jlewi
Copy link
Contributor

jlewi commented Jan 6, 2020

@kimwnasptd Why would we use PodDefaults? Why not just add a field in the UI that allows people to set the culling time?

@kimwnasptd
Copy link
Member

@jlewi I was thinking that most of the time the users will be selecting the same culling times for their different Notebook classes. Most of the time they would want their GPU Notebooks to be up as little as possible and could maybe have two or three different culling times for their normal Notebooks.

Because of this I though it would make more sense to have a drop-down for users to select a predefined culling time rather than setting it directly. This drop-down would also need to be configurable in jwa's config from the admin.

Both of these two could be satisfied with PodDefaults so I preferred it. Do you think this behavior should be more visible and have its own section?

@jlewi
Copy link
Contributor

jlewi commented Jan 27, 2020

@kimwnasptd Thank for the explanation. Can we add docs and examples for PodPresets that people can use to setup this behavior? Then we can close this issue out.

@nsshah1288
Copy link

Is there any action on this? Is there any ability in Kubeflow to cull idle pods? We want the ability to shut down Jupyter notebooks after a certain period of idle time.

@thesuperzapper
Copy link
Member

It looks like this may have been added in: #3856

But I am not sure we should be tightly integrating the status check with Jupyter itself, (so we can support other notebook types).

@jshelman
Copy link

seeing issues with the changes merged from #3856 . the activity check at ../api/status fails with a 403 error: RBAC: access denied.

and seeing log entries in the notebook controller:
020-04-11T01:50:31.165Z INFO culler Warning: GET to http://testnb.namespace.svc.cluster.local/notebook/namespace/testnb/api/status: 403

after enabling the
ENABLE_CULLING: true. as an env for the deployment.

@descampsk
Copy link

Got the same issue.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/question 0.65

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor

jlewi commented Jun 2, 2020

@kimwnasptd thoughts?

@jshelman
Copy link

jshelman commented Jun 3, 2020

i got around this with a simple hack in this case. setting the kubeflow-userid header to a value in the culler allowed the request to complete.

@descampsk
Copy link

Could you provide more details to your hack :) ?

@jshelman
Copy link

jshelman commented Jun 4, 2020

I consider it a hack, because I didn't have time to fully understand enough to fix it correctly. I assume this works because the istio service mesh and or dex is expecting this in the header - even though the api status call is un-secured.

git diff.

+++ b/components/notebook-controller/pkg/culler/culler.go
@@ -142,7 +142,15 @@ func getNotebookApiStatus(nm, ns string) *NotebookStatus {
                "http://%s.%s.svc.%s/notebook/%s/%s/api/status",
                nm, ns, domain, ns, nm)

-       resp, err := client.Get(url)
+       req, err := http.NewRequest("GET", url, nil)
+       if err != nil {
+               log.Info(fmt.Sprintf("Error creating request to %s", url), "error", err)
+               return nil
+       }
+
+       req.Header.Set("kubeflow-userid", "user@email.com")
+
+       resp, err := client.Do(req)
        if err != nil {

@stale
Copy link

stale bot commented Sep 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Sep 2, 2020
@stale stale bot closed this as completed Sep 10, 2020
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
* Add waitAllProcesses to metricsCollector config

* wait_all_processes config for python metricscollector main

* Correct boolean check

* waitAllProcesses config as bool

* add omitempty to suggestion and earlystopping

* correct default config
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Moving central dashboard links to configmap

* Updating tests for centraldashboard configmap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants