Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Do we need JupyterHub? Should we just use a web app + K8s API Server? #1630

Closed
jlewi opened this issue Sep 26, 2018 · 20 comments
Closed
Labels
area/jupyter Issues related to Jupyter

Comments

@jlewi
Copy link
Contributor

jlewi commented Sep 26, 2018

At the contributor summit yesterday, one question that came up is whether we should replace JupyterHub with a bunch of separate microservices e.g.

  • A custom web app to spawn/manage pods for notebooks using the K8s API server
  • Ambassador or ISTIO or similar to handle setting up reverse poxy routes to notebooks
  • An auth service like oauth2_proxy to handle authentication

There's a couple potential reasons for doing this

  • Long term we might be able to build a richer UI that would be better for users
    * Lyft showed a picture of their UI (hopefully they'll share slides)

  • I don't think we really want to use JupyterHub for authentication
    * Kubeflow consists of many web apps (TensorBoard, TFJobs UI, CentralDashboard, model servers)
    * I think an architecture (see Secure proxy  #11) where we put authentication in front of all the services makes sense
    * Ideally we would use ISTIO to restrict access to individual services (e.g. ensure only user X can send requests to user's X's notebook).

  • Scalability
    * We think the JupyterHub reverse proxy might be a blocker to taking advantage of K8s to scale out (see Use Ambassador/Envoy as proxy for JupyterHub #239)

It would be valuable to understand why the folks working on JupyterHub/kube-spawner went that route as opposed to a more micro-service/k8s native architecture.

@yuvipanda @foxish Can you provide any background?

/cc @pdmack @kkasravi @ioandr @inc0

Related Issues
#34 JupyterHub UI element for spawning notebooks

@yuvipanda
Copy link
Contributor

yuvipanda commented Sep 26, 2018

JupyterHub is built to run on a lot of places, not just Kubernetes. This is the primary reason it's built the way it is. It's heavily used in simple single-node installs, docker based installs, HPC systems, etc.

We have an outreachy project to make the proxy HA: https://github.com/jupyterhub/outreachy/blob/master/ideas/traefik-jupyterhub-proxy.rst. There's also an Ingress based proxy implementation already. mybinder.org also runs a JupyterHub that sees over 60k sessions per week with no scalability problems, so we're pretty confident we're ok for just now, and there should be a HA solution in the next 6 months or so.

In general, we have found that every re-implementation of 'let us spawn a bunch of notebooks' has had to fix bits and pieces of the various bugs the JupyterHub community has fixed in the last few years. Many are in the process of moving towards replacing their custom built solution with JupyterHub, contributing fixes & features upstream as they go.

re: Istio & others, we'd love to have ways to plug into that without tying ourselves to require Istio or any specific implementation. I'm trying to find some time to work with the PANGEO Community to explore things like this.

We'd love to have more engagement from the Kubeflow community in the JupyterHub community than we have right now. This will be mutually beneficial in the long run, but only if we both make explicit efforts around it. I don't really think that's happening right now in any significant form.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 26, 2018

JupyterHub is built to run on a lot of places, not just Kubernetes. This is the primary reason it's built the way it is. It's heavily used in simple single-node installs, docker based installs, HPC systems, etc.

That was my conjecture. It makes sense to me that JupyterHub provides a lot of functionality in order to support non-K8s platforms. I think my questions is whether just running JupyterHub is really the best long term solution for a good K8s native solution or whether a more microservice, K8s native architecture might be better?

In general, we have found that every re-implementation of 'let us spawn a bunch of notebooks' has had to fix bits and pieces of the various bugs the JupyterHub community has fixed in the last few years. Many are in the process of moving towards replacing their custom built solution with JupyterHub, contributing fixes & features upstream as they go.

Can you provide more examples? What sorts of problems would we run into managing the notebooks directly? On K8s I'd expect K8s is doing all the heavy lifting of managing notebooks. So its unclear to me what role hub is really playing that is likely to break if we reimplement it ourselves?

We'd love to have more engagement from the Kubeflow community in the JupyterHub community than we have right now. This will be mutually beneficial in the long run, but only if we both make explicit efforts around it. I don't really think that's happening right now in any significant form.

This is a good point and I would too. Along those lines let me tell what I'm thinking and maybe you can point out where we'd benefit by sticking with JupyterHub

  • We create a lightweight K8s custom resource for managing notebooks by using something like metacontroller
    • I think we can pretty much use kube_spawner.py as a starting point and wrap it
  • We create a rich UI ontop of this K8s CR using Angular/Polymer/React etc...
  • We can make it identity aware using JWTs

@inc0
Copy link

inc0 commented Sep 26, 2018

I'm +1 on removing Jupyterhub (I mentioned it several months ago). Number 1 issue for me is that user management in JH is not the same as in K8s. That means RBAC won't work for JH, that means we can't securely make notebooks talk to kubernetes underneath and spawn new pods (for example tfjobs). We could rip out user mgmt from JH, but what's left then?

JH is stateful and that's an issue, what happens when pod dies? Do you manage state externally? That's issue for operators.

I'd be much more inclined towards something like jupyter CRD. You will spawn deployment of JH, figure out storage backends (PVC? Object?), setup clustering for distributed training and data science (Dask cluster configuration for example) etc etc.

We can also create REST API or frontend for that, but I'd rather create k8s crd and maybe frontend for all of kubeflow (spawn me jupyter, tfjob, katib study etc on same UI).

@vkoukis
Copy link
Member

vkoukis commented Sep 28, 2018

We (@ioandr, @iliastsi, myself) agree completely that it would be best to replace JupyterHub in the long run. I have tried to lay out a plan to do that in #34 (comment).

Using JupyterHub as essentially a wrapper around KubeSpawner has a number of problems, as @inc0 also mentions in the comment above, including:

  • JupyterHub creates all pods inside the same namespace
  • JupyterHub users are not K8s users, and there is no way for K8s to make fine-grained, user-specific authorization decisions
  • As a result, users in a notebook cannot interact with K8s and work with Kubeflow components [e.g., TFJob] securely, in a unified workflow from a single pane, their notebook
  • There is no easy way to enable other authentication methods and have them shared across Kubeflow components
  • There is no easy way for end users to specify K8s-native resources to access from the spawned pods. In Extend KubeSpawner and its UI to handle Persistent Volume Claims #34 we are specifically targeting PVCs as the K8s resource we would like to be able to create or attach to a Jupyter notebook pod as it is being spawned.

However, we don’t have to do this now. We propose we continue using JupyterHub to iterate on the user-facing UI and improve the user experience focusing on changes to KubeSpawner, document exactly what our limitations with using JupyterHub are, and then switch to a K8s-specific web app to target them explicitly.

@jlewi
Copy link
Contributor Author

jlewi commented Sep 28, 2018

#34 (comment)

Sounds like a great plan to me.

@kkasravi
Copy link
Contributor

kkasravi commented Oct 1, 2018

@jlewi @vkoukis @inc0 @ashahba +1 on metacontroller. It allows rapid prototyping and could leverage a CRD. A CompositeController for example would intercept the CRD sent from KubeSpawner (or its replacement) and create a deployment whose podtemplate specified a ServiceAccountName and PodSecurityContext appropriate for the user.

@dhirschfeld
Copy link

dhirschfeld commented Oct 4, 2018

I've stumbled across this as a JupyterHub admin running a custom Windows HPC / docker setup but interested in moving to kubernetes. As such I really don't know what I'm talking about when it comes to K8s but some of the assertions above don't seem entirely correct to me.

I'll pull out authorization in particular:

JupyterHub users are not K8s users, and there is no way for K8s to make fine-grained, user-specific authorization decisions

IIUC the JupyterHub authentication system is pluggable - what's preventing anyone writing a K8sAuthenticator that integrates with the rest of the K8s ecosystem? For example, I'm using the LDAPAuthenticator to authenticate users against our AD service. User resource limitations, mapped volumes and available images are all controlled by AD group membership completely external of anything to do with JupyterHub itself.

As for specifying resources:

There is no easy way for end users to specify K8s-native resources

I don't see how you can't simply have a custom options form. In my case, all users have a default home drive mapped but other network folders are mapped into the container based on their AD group membership. The user doesn't have an option to not have a mapped folder available, but why would they care if they had more access than they needed.

I just don't see that any of the listed problems aren't easily fixable with very minimal code rather than replacing the entire application. I think both communities would benefit from working together to fix the problems rather than going their own separate ways.

Anyway, just my 2c - there may well be some K8s specific issues I'm unaware of which would make solving the stated problems more difficult than simply starting from scratch - I'm not knowledgeable enough about K8s to say but I'd be very interested to hear @yuvipanda's opinion on the matter.

@minrk
Copy link

minrk commented Oct 4, 2018

Since it was asked, a bit of history on why JupyterHub has its current design and scaling characteristics: The target use case was a single machine with 5-50 users, and several design decisions were taken with user-space installability, maintainability, and simplicity in mind, while scalability was explicitly out of scope as something we knew we didn't have the resources to tackle. Since then, our user community has developed in a different direction than initially expected, and we have worked on scaling, but running more than 5k concurrent active users still isn't supported without deploying multiple independent Hubs (as @yuvipanda has done). If we had built JupyterHub to be more kube/cloud/scalable-native, it would look quite different.

On the proxy as single-point-of-failure and scaling bottleneck: The default proxy implementation does now support external storage for its routing table, and there is an implementation for redis, which means the proxy should be able to scale reasonably well. I have not tested this in production, though. You do seem to need several thousand users before proxy performance becomes an issue, so exposing this hasn't been a high priority, yet. A better fit, especially in the k8s community, is probably the current plan to make an etcd-backed traefik implementation, as @yuvipanda mentioned.

With that said, JupyterHub is not the way to deploy notebooks on behalf of users. It is one way, and meant to simplify deploying notebooks with one particular pattern. If your pattern/integrations are strongly divergent from the design of JupyterHub, it may well be more work to coerce JupyterHub into behaving how you like than to implement your own solution, tailored to your needs. Ultimately, all JupyterHub+KubeSpawner does is launch pods and provide routing/authentication. As with any shared infrastructure, JupyterHub and KubeSpawner have accumulated loads of fixes and helpers for corner cases, etc. that people have faced over the years. I'd say this is the main benefit to using JupyterHub for a case like yours, and the main thing you lose when rolling your own.

For Kubernetes experts like you folks, deploying notebooks in pods in a kube-native application, it is very likely simpler to forego JupyterHub altogether. The target audience for JupyterHub is pretty much the complement of kubeflow developers, attempting to simplify deploying notebooks on behalf of users for folks who don't understand Kubernetes, rather than the other way around. Of course, we're very happy to have kubeflow use JupyterHub, but I would never argue that it's always the right choice, and for deployment experts with a given technology, there's a very good chance it's not. Getting feedback from you folks on exactly why/how JupyterHub isn't a working well is super useful for us in guiding future development, whether you stick with it or not.

Even if/while you stick with JupyterHub, some ideas on reducing friction and accomplishing the listed goals piecemeal:

  • if you have plans to use jwt/kube/other authentication mechanisms, do that. Then, for JupyterHub, write a lightweight Authenticator that defers to your existing system (via oauth or headers coming from a proxied request), rather than talking all the way up to GitHub/Google/etc. An example of this is the REMOTE_USER authenticator, which only looks at the REMOTE_USER header set by the apache shibboleth plugin, which does all the real auth.
  • If you want highly customizable UI, consider using JupyterHub purely as a backend REST API, rather than trying to serve increasingly complex UI through JupyterHub itself via Spawner.options_form or custom jinja templates. This is how Binder works, for instance. At no point do users see JupyterHub UI in that application, even though all deployments/monitoring/etc. go through JupyterHub.

If you take these two approaches (together or separately), it may reduce friction with JupyterHub in the short term, and should you choose to move away from JupyterHub in the future, it should make that transition simpler, since fewer components are really relying on JupyterHub.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 4, 2018

Thanks @minrk for the detailed info.

I also see you just added multi user jupyterhub/jupyterhub#2154; very nice.

I also see @yuvipanda has some in progress work to support per user namespaces. jupyterhub/kubespawner#76

@jlewi
Copy link
Contributor Author

jlewi commented Oct 4, 2018

Here's some more context for the JupyterHub folks.

Here's a diagram of Kubeflow's current architecture
kubeflow jupyter ingress

Ambassador is a programmable reverse proxy.

  • You add routes just by adding an annotation to a K8s service.

Ambassador supports external auth.

  • It supports basic auth
  • I'd love to see someone had support for an OAuth plugin that supports authenticating using
    supporting services like GItHub, Google, Facebook, Firebase, etc...

For comparison here's the JupyterHub diagram that I pulled from the JupyterHub docs

jupyterhub

So in our case Ambassador replaces the reverse proxy in JupyterHub and external auth support in Ambassador replaces JupyterHub's authenticator plugins.

As the diagram shows we have many web apps that might require authentication so we'd really like to do authentication outside JupyterHub e.g. via Ambassador external auth so that we don't have to reimplement it for each app.

This is what we do right now on GCP where we use IAP to attach JWT's to requests and we just configure JupyterHub to do JWT checking.

If we move to ISTIO to doing JWT checking then we can manage this centrally for all the web apps which we will most likely need to do anyway.

So the two pieces we care most about in JupyterHub

  • Are the UI for managing notebooks
  • The API for launching notebooks

We could use JupyterHub as a REST API (and that is pretty much the plan of record I think from @vkoukis). As a long term solution though I think a K8s custom resource might give us a simpler CRUD server with a more K8s native API.

In particular, (following the pattern of TFJob/PyTorchJob) we could surface the K8s object e.g. the PodTemplateSpec directly in the spec. This would eliminate a lot of the spawner_options that are just a layer of indirection around K8s fields like labels, annotations, and sidecars.

We can also take advantage of Admission controllers to dynamically inject common configuration (e.g. a PVC that should be attached to all pods).

I suspect we don't need the user database because we can just use K8s metadata to track which notebooks belong to which users.

So long term I think we'd want to move away from the JupyterHub spawner API to a more K8s native API. When that happens how easy would it be to share a front end implementation?

It would be great to collaborate with the Jupyter/JupyterHub communities on a K8s CRD for managing jupyter notebooks on K8s. But would that be of interest to the Jupyter/JupyterHub communities? Would you want to pull in a dependency like metacontroller?

@jlzhao27
Copy link

jlzhao27 commented Oct 5, 2018

Thanks for the context @minrk and agree with @jlewi on the long term plan for a k8s based CRD. At my company, we rolled our own k8s API based notebook launch system instead of JupyterHub for a lot of the same reasons:

  1. we did not need jupyter hub auth
  2. we wanted to use our own ingress/networking set up
  3. we wanted high level of customizability on the notebook templates themselves (PVC, labels, annotations, volumes, secrets).

I think the JupyterHub project is moving in a direction where all these pieces are pluggable but it requires learning the JupyterHub API instead of using native k8s based APIs. We could've definitely made JupyterHub work but ultimately decided it was easier to write our own API on top of raw k8s since it gave us maximum flexibility and we already have the k8s domain knowledge.

For kube-flow in general, I would biased towards native k8s API whenever possible since that is what the user-base is comfortable in. It requires a non-trivial amount of effort to set up and configure a JupyterHub server with all the right options if you don't want the default auth and networking stack. I think switching kube-spawner to a k8s CRD for notebooks is a great middle ground since it would pick up upstream fixes from the JupyterHub community while avoiding the need to run a heavy server/learn new technology for many k8s native users.

We might want a lightweight UI/REST endpoint around the CRD since talking to k8s API from a browser is pretty painful but I think that can be solved independently.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 6, 2018

@jlzhao27 Thanks!

Can you share more about how you are controlling notebooks? In particular which K8s controllers you are using? Did you have to implement any control logic beyond what the built in controllers provide?

My initial thought is that all we need is a statefulset and service. So a CRD would mostly be a small convenience. Longer term it would allow us to enable features like culling idle pods without exposing that to clients.

@jlzhao27
Copy link

jlzhao27 commented Oct 6, 2018

We actually were able to get away with only using Jobs, Service and Ingress as the main components. We mount a bunch of external resources like configmaps, secrets and use annotations/labels heavily for kiam and other dependencies.

Statefulsets were not ideal because they require a PVC whereas we sometimes wanted to manage volumes externally and attach the same PV to multiple notebook containers (we are using an NFS based filesystem).

We started initially with Deployments but found the activeDeadlineSeconds: x super helpful in having automatic termination of the notebooks so switched to using the Job controller instead. I think between SS, deployment, and job, any would probably be okay as long as the pod template is configurable to a high degree.

@jlewi
Copy link
Contributor Author

jlewi commented Nov 26, 2018

Filed some more specific issues #1994 and #1995.

I think the next step would be to write a design doc summarizing this issue and providing a complete picture of what a solution that doesn't involve JupyterHub.

Lets use this issue to track writing the doc and close it when we have such a doc.

@kkasravi
Copy link
Contributor

sounds great, thanks @jlewi

@jlewi
Copy link
Contributor Author

jlewi commented Dec 26, 2018

I've drafted an initial design doc:
http://bit.ly/kf_jupyter_design_doc

Feel free to comment on the doc.

Closing this issue now that we have the design doc.

@jlewi jlewi closed this as completed Dec 26, 2018
@mlushpenko
Copy link

@jlewi we have a working setup of jupyterhub with istio and external authentication service that is integrated with our OIDC provider. This setup allows separation of multiple tenants/teams and users within each team. I will post a diagram and if you find it interesting, I can try to describe it in more details
jupyter-k8s

@jlewi
Copy link
Contributor Author

jlewi commented Sep 12, 2019

@mlushpenko looks valuable but at this time Kubeflow has fully migrated off JupyterHub and I don't think we are going back. We want OIDC support and multi-tenancy for multiple applications not just jupyter so it doesn't make sense to go with a JupyterHub centric approach.

@mlushpenko
Copy link

@jlewi we did a test with pure notebooks as well, I can draw another diagram for that, it's close to this but then path-based routing is used for user separation and you need some component to pre-create notebooks with specific naming/labels (kubeflow in your case). We actually want to use our setup for more than notebooks as well.

And we wanted to use kubeflow or parts of it first, but it had too much stuff hardcoded so wasn't flexible enough for us if we wanted to start with only some kubeflow components and integrated it with istio in our own way (kubeflow had hardcoded istio gateway and some other things)

@mlushpenko
Copy link

You are right, you have a solution already, thought maybe some ideas could be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/jupyter Issues related to Jupyter
Projects
None yet
Development

No branches or pull requests

9 participants