Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fire off TFJob from Jupyter Notebook #1240

Closed
jlewi opened this issue Jul 19, 2018 · 67 comments
Closed

Fire off TFJob from Jupyter Notebook #1240

jlewi opened this issue Jul 19, 2018 · 67 comments
Assignees
Labels
Milestone

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 19, 2018

We'd like to make it super easy to go from writing code in a notebook to training that model distributed.

Experience might be something like

  • User writes code in notebook and executes in Jupyter lab
  • User clicks a button which allows user to fill in various settings e.g. number of GPUs
  • User clicks train

Under the hood this would cause

  • A docker image to be built
  • A TFJob/PyTorch/K8s Job to be created and fired off.

I think the biggest challenge is that we probably don't want to execute all code in the notebook. Typically, there's some amount of refactoring that needs to be done to convert a notebook into a python module suitable for execution in a bash job.

As a concrete example

Here's the notebook for our GitHub Issue summarization example

Here's the corresponding python module used when training in a K8s job.

The python module only executes a subset of cells in particular those to

  1. Define model architecture
  2. train the model

Rather than try to auto-convert a notebook like the github issue example, I think we should require users structure their code to facilitate the conversion.

My suggestion would be to allow any functions defined in the notebook to be used as entry points. So for the GitHub issues summarization the user would have a cell like the following

from keras.callbacks import CSVLogger, ModelCheckpoint

def train_model(output)
  script_name_base = 'tutorial_seq2seq'
  csv_logger = CSVLogger('{:}.log'.format(script_name_base))
  model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}- 
  val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                                  save_best_only=True)

  batch_size = 1200
  epochs = 7
  history = seq2seq_Model.fit([encoder_input_data, decoder_input_data], 
  np.expand_dims(decoder_target_data, -1),
           batch_size=batch_size,
           epochs=epochs,
           validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

   seq2seq_Model.save(output)

train('seq2seq_model_tutorial.h5')

If user structures their code this way, we should be able to manually create and invoke a suitable container entry point. Something like the following

  • Use nbconvert to convert from ipynb to python code
  • Post process the python code
    • Strip out any statements not inside a function (except imports)
    • Create a CLI for the functions using a library like PyFire
  • Build a Docker image that is Notebook image + code

A variant of this idea would be to use metaml (by @wbuchwalter ). metaml uses metaparticle to allow people to annotate their python code with information needed to then run it on K8s (e.g. distributed using TFJob). If we went this approach I think the flow would be

  • Run nbconvert to go from ipynb -> py
  • Use metaparticle/metaml tool chain to build the docker image and submit the job.

@willingc @yuvipanda Is there existing tooling in the Jupyter community other than nbconvert to convert notebooks to code suitable for asynchronous batch execution?

/cc @wbuchwalter @gaocegege @yuvipanda @willingc

@jlewi jlewi added priority/p2 area/jupyter Issues related to Jupyter labels Jul 19, 2018
@wbuchwalter
Copy link
Contributor

👍 Booking some time to spike MetaML for this use case in the coming days. Will report back on my findings.

@wbuchwalter
Copy link
Contributor

So it seems like it could work well.
Still very much alpha, but you can check what it looks like here: https://github.com/wbuchwalter/metaml/blob/jupyter/examples/jupyter-notebook/TfJob.ipynb
The main limitation is that the model has to be defined as a class, the reason is two-fold:

  • This is an inherent requirement of MetaML for more advanced training strategies (i.e. population based training) where we need several entry point at different stages of the model lifecycle.
  • For notebooks this ensure that the code is running in proper order. The issue with nbconvert is that it is like executing all the cells sequencially, which might not be the way users interact with there notebooks. Forcing the user to define a class instead mitigate at least in part this issue.

Other remarks:

@jlewi
Copy link
Contributor Author

jlewi commented Jul 24, 2018

This looks very promising.

Were you running the notebook on your local machine or in the K8s cluster? On K8s any idea how we would build the container?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 25, 2018

The newly released build CRD
https://github.com/knative/build

Looks promising.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 25, 2018

Some offline discussion with @wbuchwalter.

Using metaml might be more than we need and its blocked because metaparticle doesn't support CRDs.
So a custom script might be simpler.

Especially if we take advantage of knative to do the container build we don't need that aspect of the metaparticle tool chain.

@inc0
Copy link

inc0 commented Jul 26, 2018 via email

@wbuchwalter
Copy link
Contributor

@jlewi Note that the build is happening in MetaML not in metaparticle. so we can easily plug knative there.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 27, 2018

@wbuchwalter I think people would really love this so my priority would be getting something people could play with as soon as possible. So w.r.t to using MetaMl/MetaParticle or custom scripts; which ever is faster SGTM.

w.r.t the larger question about whether MetaParticle is the right solution

Metaparticle takes a DSL as input (which basically just describe your desired state in a higher level than plain k8s objects) and then directly talk to the cluster.

Metaparticle is not involved in communication at runtime it's really just for lifecycle. For example for PBT i'm deploying a Redis instance at the same time that is used for communication.

What does the DSL look like? How would this compare to a python object that has the same structure as the underlying K8s resource maybe with some helper functions. For example, suppose we add a TFJob python class that had properties fields that matched TFJob spec so you could do things like

tfjob.tfReplicas["Workers"].replicas = 5

And maybe some sugar methods to make common modifications easier e.g.

tfJob.setEnvironmentVariable("some_var", "some_secret")

Maybe what we really need is for K8s to provide better tooling for auto-generating native Python libraries.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 27, 2018

@wbuchwalter I took a closer look and I think you're right and we should start with MetaMl.

train.py looks like what I had in mind. So I don't see any reason not to start with that.

I'm still unsure about metaparticle long term. It looks like metaml just uses metaparticle here to generate this dictionary matching the TFJob spec.

Why not just bypass metaparticle and create the spec directly? It seems like it might just be a layer of indirection that we don't need since we don't want to target non-K8s architectures.

@wbuchwalter
Copy link
Contributor

@jlewi This is where I specify the TfJob specs to pass to metaparticle: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/architectures/kubeflow/distributed.py#L13

I don't think you should see metaparticle's abstraction as a way to target multiple architecture.
The main role of metaparticle is to make it easier to express higher level distributed concepts, the 'cross-platform' is mostly a by-product of having an intermediate representation language (To be clear, for MetaML, I only intend to support Kubernetes as well as I don't see any real use case for native Docker. But if docker support comes for free with Metaparticle I won't block it either).

So for simple jobs, metaparticle doesn't bring much to the table since you are really only deploying a single resource, but for more complex use cases, it allows you do deploy components pretty succintly. For example, for PBT, I am also deploying a redis server: https://github.com/wbuchwalter/metaml/blob/master/metaml/strategies/pbt/pbt.py#L70

From the perspective of MetaML, since metaparticle supports a subset (at least today) of the native kubernetes objects, it allows users to write their own custom training strategies (just implement a exec_user_code method, see: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/strategies/hp.py#L15) and it should work without having to wait for the underlying platform to provide specific objects for this kind of training.

So overall I think MetaML solves two separate pain-points today:

  1. It's hard to deploy a training (from a notebook or not): MetaML allows data scientists to do that using the tool they are the most familiar with (python)
  2. It's hard to deploy complex training strategies: How would you deploy population based training on top of Kubeflow today? Or complex hyper-parameter search? MetaML, by being directly at the language level allows you to have hooks at different stages of the model lifecycle, this would be hard to achieve from outside.

So while this issue only really cares about 1., I think it would be a shame to not also get the benefits of 2. when it's already there.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 29, 2018

@wbuchwalter SGTM.

What are the next steps to allow users to start submitting jobs from a notebook as in your example

  • Can users easily clone or pip install MetaML in their notebooks in order to start using it?
  • I'm guessing the biggest problem is building the container from within the cluster. How do we go about adding support for mechanisms to build the container from within cluster?

@wbuchwalter
Copy link
Contributor

wbuchwalter commented Jul 30, 2018

@jlewi

  1. I will rename the project. MetaML is already taken(http://segatalab.cibio.unitn.it/tools/metaml/) (and is a bad name anyway, since it alludes to metalearning)
  2. As soon as this is done I will create an official pip package so that it's easier to install.
  3. I will to look into that, knative or img seem to be the best bets currently, but any suggestion welcome as I am definitly not expert here.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 30, 2018

@wbuchwalter Thanks; I think the build CRD is the way to go because to some extent it abstracts away the builder. The BUILD CRD basically takes a YAML spec describing a workflow to build the image. As part of the spec you can specify which (supported) builder to use e.g. kaniko. So by using the build CRD we should be able to sidestep the debate of buildah vs. kaniko vs. img. If someone wants to use a particular tool they just have to integrate it into the build CRD.

@aronchick
Copy link
Contributor

The other half of this, I think, offering a template "notebook" which could be auto-stripped/broken down to ease building into the trainable image. I've talked with a few different folks who are interested in this, lmk if you'd like to take it on.

@wbuchwalter
Copy link
Contributor

wbuchwalter commented Jul 31, 2018

@aronchick currently I'm using nbconvert, which convert the .ipynb into a .py file in the order the cell are defined. What would be a better solution in your opinion? I wasn't planning on going further on this topic, so if anyone has a better idea, feel free.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 6, 2018

@willb FYI since you had chimed in a while back on #110

@willb
Copy link
Contributor

willb commented Aug 6, 2018

@jlewi thanks! I've been following this issue but will take the opportunity to join the discussion.

@wbuchwalter I think nbconvert is actually the right approach. What I've been working towards is a pipeline that will go from a notebook that trains a model to an image to serve that model (specifically using OpenShift's source-to-image). As Jeremy mentioned, here's some analysis work to solve part of the problem. I also have a minimal source-to-image builder to put Python models behind a basic API in a pod. I need to glue the analysis and builder parts together (and make it more general, e.g., to work with a different image builder or with another model server), but it would be good to sync up and see if there are some opportunities for collaboration.

@wbuchwalter
Copy link
Contributor

@willb Thanks, I actually spent some times trying to play with inspect as in your article a few months ago but couldn't make it work for some edge cases (that I don't remember). However your article seem to far more extensive than what I did, so if it works well I think it's an interesting approach.

So far with MetaML (just renamed to fairing) I'm forcing users to use a class for their model which mitigate a big part of this problem since that means the code is structured in a more predictable way.

I am targeting to deliver an official pip package + a way to easily install everything in a notebook this week, if you have time to try it out, I would be very interested in hearing your feedback and how s2i could help here.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 7, 2018

I like @wbuchwalter's approach of simplifying the problem by forcing users to structure their code a certain way. That seems very reasonable to me

@ukclivecox
Copy link
Contributor

We use s2i at Seldon so am also interested to know to of techniques to package models from a notebook easily.

@wbuchwalter
Copy link
Contributor

wbuchwalter commented Aug 8, 2018

@jlewi I looked into Knative's Build CRD. The main issue is that we need to somehow pass the source to the builder to be used as context, 3½ ways are currently supported by the CRD:

  • git: Since we want users to be able to launch training directly from a Notebook we can't expect them to commit through the notebook everytime they make a change. The only solution I can think of is running a local temporary git server and commit automatically on every call to Train (also that would assume that we have a service exposing it). We would then reference this git server in the Knative Build definition. This seems a bit brittle and complex though.
  • gcs: only works on GKE.
  • custom: This allows us to specify a custom image that would be responsible for gathering the source. We could use this to to talk to the Notebook's API, using /api/contents and /api/contents/<filename> we should be able to extract all the sources, but how can we get the credentials to access the API?
  • There is also a way to mount volumes into the builder (I think), but how can we ensure that every K8s cluster where we are running has access to a PV that allows ReadWriteMany?

@jlewi
Copy link
Contributor Author

jlewi commented Aug 9, 2018

@wbuchwalter Why isn't volume support sufficient?
https://github.com/knative/docs/blob/master/build/builds.md#using-an-extra-volume

There's lots of ways for people to create solutions for their particular environment. For example, you can use the NFS Provisioner to run NFS within cluster.

You can use an empty dir volume as a cache across steps. So one solution is to use object storage and just copy it to the cache in a step that runs before the build.

@wbuchwalter
Copy link
Contributor

@jlewi, If we are ok with introducing a new requirement (a ReadWriteMany PV) then yes I don't see any issue with that.
Ideally, a solution with no extra requirement is still better IMHO, but I don't have one...
For emptyDir yes it could work, but then we would need to ensure the notebook and the builder are running on the same node.
Since users can have nodes with arbitrary capacities and can request arbitrary resources for their notebook, we can't really guarantee that the builder will fit on the same node. What do you think?

@inc0
Copy link

inc0 commented Aug 10, 2018

@wbuchwalter alternative is S3 - it's fairly popular protocol and I think every major cloud (not sure about Azure) + on prem have good solutions for it. RWX is expensive thing to maintain, NFS isn't particularly great.
That comes back to idea I keep talking about - storage backend for kubeflow....

@jlewi
Copy link
Contributor Author

jlewi commented Aug 14, 2018

@wbuchwalter I think something that works with ReadWriteMany is fine.

@inc0 as noted above I think anyone with object storage can use that to pass around a tarball containing the build context.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 14, 2018

@wbuchwalter any update on this?

/cc @r2d4

@wbuchwalter
Copy link
Contributor

@jlewi Licence approval should be done in the next few days, and then I will be able to transfer the repository.

Also, with JupyterHub, I cannot find a proper way to infer the name of the currently running notebook (which is needed to know which notebook should be started).
With standalone notebook I can just call the API to get that.

I have a PR opened to allow users to specify the name of the notebook themselves when using JupyterHub, it works but it's obviously not the best UX.

#1630 would be a good way to avoid this issue entirely.

@lresende
Copy link
Member

Why does JupyterHub play a part here? Can't you just configure it to launch a custom notebook image that then has your extension installed and you then can use the Javascript API to see the contents and name of the running Notebook?

@wbuchwalter
Copy link
Contributor

@lresende I am no expert in Jupyter, so there may be an easy solution to do this that I am not aware of (that would be great).

Currently to get the name of the notebook I am using the notebook API (see https://github.com/wbuchwalter/fairing/blob/master/fairing/notebook_helper.py#L9). This works well with standalone notebook with token auth.
However when using JupyterHub, authentication is based on user/password, which prevents me from accessing the API.

I haven't looked at the Javascript API, I will try to investigate.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 23, 2018

@r2d4 how did you solve this problem?

@wbuchwalter
Copy link
Contributor

I have a PR opened that adds an argument to the the decorator where users can pass the name of the notebook.
If not running in JupyterHub, this argument can be omitted and fairing will infer the name by itself.

@r2d4
Copy link
Member

r2d4 commented Oct 23, 2018

I solved in a similar way. I'm not sure if it will work in more advanced scenarios.

@chrisheecho
Copy link

/remove-priority p1

@jlewi
Copy link
Contributor Author

jlewi commented Oct 25, 2018

@chrisheecho How come you removed P1? I think we really want to get this done in 0.4.0

I filed a couple of issues to split this up

#1857 Tooling to go from notebook to docker container
#1858 Widget to train/deploy

/label priority/p1

@chrisheecho
Copy link

I removed the priority on all the issues that we have not talked about together with eng+pm thinking we would go through it. I agree though that it should be at least P1 if not P0

@chrisheecho
Copy link

/priority p1

@richardsliu
Copy link
Contributor

/assign @r2d4

@richardsliu richardsliu removed their assignment Nov 7, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Nov 19, 2018

@r2d4 What are the next steps here? Can we just use the K8s client library to fire off TFJobs from a notebook? Do we need a higher level SDK to make this easy?

@r2d4
Copy link
Member

r2d4 commented Nov 26, 2018

The fairing code is a small python library that needs to be installed in the notebook image. It handles both building the image (through kaniko, docker, or the “append” strategy) and deploying to TFJob

@jlewi
Copy link
Contributor Author

jlewi commented Dec 3, 2018

@r2d4 So can we go ahead and close this issue?

@r2d4 r2d4 closed this as completed Dec 3, 2018
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Add documentation of kustomize version

* Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests