Fire off TFJob from Jupyter Notebook #1240

jlewi · 2018-07-19T13:30:09Z

We'd like to make it super easy to go from writing code in a notebook to training that model distributed.

Experience might be something like

User writes code in notebook and executes in Jupyter lab
User clicks a button which allows user to fill in various settings e.g. number of GPUs
User clicks train

Under the hood this would cause

A docker image to be built
A TFJob/PyTorch/K8s Job to be created and fired off.

I think the biggest challenge is that we probably don't want to execute all code in the notebook. Typically, there's some amount of refactoring that needs to be done to convert a notebook into a python module suitable for execution in a bash job.

As a concrete example

Here's the notebook for our GitHub Issue summarization example

Here's the corresponding python module used when training in a K8s job.

The python module only executes a subset of cells in particular those to

Define model architecture
train the model

Rather than try to auto-convert a notebook like the github issue example, I think we should require users structure their code to facilitate the conversion.

My suggestion would be to allow any functions defined in the notebook to be used as entry points. So for the GitHub issues summarization the user would have a cell like the following

from keras.callbacks import CSVLogger, ModelCheckpoint

def train_model(output)
  script_name_base = 'tutorial_seq2seq'
  csv_logger = CSVLogger('{:}.log'.format(script_name_base))
  model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}- 
  val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                                  save_best_only=True)

  batch_size = 1200
  epochs = 7
  history = seq2seq_Model.fit([encoder_input_data, decoder_input_data], 
  np.expand_dims(decoder_target_data, -1),
           batch_size=batch_size,
           epochs=epochs,
           validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

   seq2seq_Model.save(output)

train('seq2seq_model_tutorial.h5')

If user structures their code this way, we should be able to manually create and invoke a suitable container entry point. Something like the following

Use nbconvert to convert from ipynb to python code
Post process the python code
- Strip out any statements not inside a function (except imports)
- Create a CLI for the functions using a library like PyFire
Build a Docker image that is Notebook image + code

A variant of this idea would be to use metaml (by @wbuchwalter ). metaml uses metaparticle to allow people to annotate their python code with information needed to then run it on K8s (e.g. distributed using TFJob). If we went this approach I think the flow would be

Run nbconvert to go from ipynb -> py
Use metaparticle/metaml tool chain to build the docker image and submit the job.

@willingc @yuvipanda Is there existing tooling in the Jupyter community other than nbconvert to convert notebooks to code suitable for asynchronous batch execution?

/cc @wbuchwalter @gaocegege @yuvipanda @willingc

The text was updated successfully, but these errors were encountered:

wbuchwalter · 2018-07-19T14:12:32Z

👍 Booking some time to spike MetaML for this use case in the coming days. Will report back on my findings.

wbuchwalter · 2018-07-24T14:14:32Z

So it seems like it could work well.
Still very much alpha, but you can check what it looks like here: https://github.com/wbuchwalter/metaml/blob/jupyter/examples/jupyter-notebook/TfJob.ipynb
The main limitation is that the model has to be defined as a class, the reason is two-fold:

This is an inherent requirement of MetaML for more advanced training strategies (i.e. population based training) where we need several entry point at different stages of the model lifecycle.
For notebooks this ensure that the code is running in proper order. The issue with nbconvert is that it is like executing all the cells sequencially, which might not be the way users interact with there notebooks. Forcing the user to define a class instead mitigate at least in part this issue.

Other remarks:

To run nbconvert I need to know the name of the currently running notebook, this is how I am doing it: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/notebook.py, @willingc do you know a better way to do this?

jlewi · 2018-07-24T18:07:09Z

This looks very promising.

Were you running the notebook on your local machine or in the K8s cluster? On K8s any idea how we would build the container?

jlewi · 2018-07-25T01:52:22Z

The newly released build CRD
https://github.com/knative/build

Looks promising.

jlewi · 2018-07-25T23:36:34Z

Some offline discussion with @wbuchwalter.

Using metaml might be more than we need and its blocked because metaparticle doesn't support CRDs.
So a custom script might be simpler.

Especially if we take advantage of knative to do the container build we don't need that aspect of the metaparticle tool chain.

inc0 · 2018-07-26T00:31:55Z

My testing so far with estimators looks promising in context of using same code for single threaded (notebook) and distributed. I think we'll need users to create certain code standards tho. 1 we need to provide ability to have 2 different datasets for notebook and distributed 2 we may need different location for things like tf.event files, logs or saved model I think both can be solved by using env variables and communicate how to use them for your benefit.

…

On Wed, Jul 25, 2018, 4:36 PM Jeremy Lewi ***@***.***> wrote: Some offline discussion with @wbuchwalter <https://github.com/wbuchwalter> . Using metaml might be more than we need and its blocked because metaparticle doesn't support CRDs. So a custom script might be simpler. Especially if we take advantage of knative to do the container build we don't need that aspect of the metaparticle tool chain. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1240 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqkuJbqG9dwCEt9qaF3cBprxOS-aGBjks5uKQEDgaJpZM4VWZmX> .

wbuchwalter · 2018-07-27T13:57:17Z

@jlewi Note that the build is happening in MetaML not in metaparticle. so we can easily plug knative there.

jlewi · 2018-07-27T16:29:14Z

@wbuchwalter I think people would really love this so my priority would be getting something people could play with as soon as possible. So w.r.t to using MetaMl/MetaParticle or custom scripts; which ever is faster SGTM.

w.r.t the larger question about whether MetaParticle is the right solution

Metaparticle takes a DSL as input (which basically just describe your desired state in a higher level than plain k8s objects) and then directly talk to the cluster.

Metaparticle is not involved in communication at runtime it's really just for lifecycle. For example for PBT i'm deploying a Redis instance at the same time that is used for communication.

What does the DSL look like? How would this compare to a python object that has the same structure as the underlying K8s resource maybe with some helper functions. For example, suppose we add a TFJob python class that had properties fields that matched TFJob spec so you could do things like

tfjob.tfReplicas["Workers"].replicas = 5

And maybe some sugar methods to make common modifications easier e.g.

tfJob.setEnvironmentVariable("some_var", "some_secret")

Maybe what we really need is for K8s to provide better tooling for auto-generating native Python libraries.

jlewi · 2018-07-27T16:58:41Z

@wbuchwalter I took a closer look and I think you're right and we should start with MetaMl.

train.py looks like what I had in mind. So I don't see any reason not to start with that.

I'm still unsure about metaparticle long term. It looks like metaml just uses metaparticle here to generate this dictionary matching the TFJob spec.

Why not just bypass metaparticle and create the spec directly? It seems like it might just be a layer of indirection that we don't need since we don't want to target non-K8s architectures.

wbuchwalter · 2018-07-27T17:55:41Z

@jlewi This is where I specify the TfJob specs to pass to metaparticle: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/architectures/kubeflow/distributed.py#L13

I don't think you should see metaparticle's abstraction as a way to target multiple architecture.
The main role of metaparticle is to make it easier to express higher level distributed concepts, the 'cross-platform' is mostly a by-product of having an intermediate representation language (To be clear, for MetaML, I only intend to support Kubernetes as well as I don't see any real use case for native Docker. But if docker support comes for free with Metaparticle I won't block it either).

So for simple jobs, metaparticle doesn't bring much to the table since you are really only deploying a single resource, but for more complex use cases, it allows you do deploy components pretty succintly. For example, for PBT, I am also deploying a redis server: https://github.com/wbuchwalter/metaml/blob/master/metaml/strategies/pbt/pbt.py#L70

From the perspective of MetaML, since metaparticle supports a subset (at least today) of the native kubernetes objects, it allows users to write their own custom training strategies (just implement a exec_user_code method, see: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/strategies/hp.py#L15) and it should work without having to wait for the underlying platform to provide specific objects for this kind of training.

So overall I think MetaML solves two separate pain-points today:

It's hard to deploy a training (from a notebook or not): MetaML allows data scientists to do that using the tool they are the most familiar with (python)
It's hard to deploy complex training strategies: How would you deploy population based training on top of Kubeflow today? Or complex hyper-parameter search? MetaML, by being directly at the language level allows you to have hooks at different stages of the model lifecycle, this would be hard to achieve from outside.

So while this issue only really cares about 1., I think it would be a shame to not also get the benefits of 2. when it's already there.

jlewi · 2018-07-29T03:01:09Z

@wbuchwalter SGTM.

What are the next steps to allow users to start submitting jobs from a notebook as in your example

Can users easily clone or pip install MetaML in their notebooks in order to start using it?
I'm guessing the biggest problem is building the container from within the cluster. How do we go about adding support for mechanisms to build the container from within cluster?

wbuchwalter · 2018-07-30T14:14:20Z

@jlewi

I will rename the project. MetaML is already taken(http://segatalab.cibio.unitn.it/tools/metaml/) (and is a bad name anyway, since it alludes to metalearning)
As soon as this is done I will create an official pip package so that it's easier to install.
I will to look into that, knative or img seem to be the best bets currently, but any suggestion welcome as I am definitly not expert here.

jlewi · 2018-07-30T17:10:23Z

@wbuchwalter Thanks; I think the build CRD is the way to go because to some extent it abstracts away the builder. The BUILD CRD basically takes a YAML spec describing a workflow to build the image. As part of the spec you can specify which (supported) builder to use e.g. kaniko. So by using the build CRD we should be able to sidestep the debate of buildah vs. kaniko vs. img. If someone wants to use a particular tool they just have to integrate it into the build CRD.

aronchick · 2018-07-30T23:18:42Z

The other half of this, I think, offering a template "notebook" which could be auto-stripped/broken down to ease building into the trainable image. I've talked with a few different folks who are interested in this, lmk if you'd like to take it on.

wbuchwalter · 2018-07-31T13:29:22Z

@aronchick currently I'm using nbconvert, which convert the .ipynb into a .py file in the order the cell are defined. What would be a better solution in your opinion? I wasn't planning on going further on this topic, so if anyone has a better idea, feel free.

jlewi · 2018-08-06T13:31:54Z

@willb FYI since you had chimed in a while back on #110

willb · 2018-08-06T18:41:15Z

@jlewi thanks! I've been following this issue but will take the opportunity to join the discussion.

@wbuchwalter I think nbconvert is actually the right approach. What I've been working towards is a pipeline that will go from a notebook that trains a model to an image to serve that model (specifically using OpenShift's source-to-image). As Jeremy mentioned, here's some analysis work to solve part of the problem. I also have a minimal source-to-image builder to put Python models behind a basic API in a pod. I need to glue the analysis and builder parts together (and make it more general, e.g., to work with a different image builder or with another model server), but it would be good to sync up and see if there are some opportunities for collaboration.

wbuchwalter · 2018-08-07T14:22:55Z

@willb Thanks, I actually spent some times trying to play with inspect as in your article a few months ago but couldn't make it work for some edge cases (that I don't remember). However your article seem to far more extensive than what I did, so if it works well I think it's an interesting approach.

So far with MetaML (just renamed to fairing) I'm forcing users to use a class for their model which mitigate a big part of this problem since that means the code is structured in a more predictable way.

I am targeting to deliver an official pip package + a way to easily install everything in a notebook this week, if you have time to try it out, I would be very interested in hearing your feedback and how s2i could help here.

jlewi · 2018-08-07T17:57:55Z

I like @wbuchwalter's approach of simplifying the problem by forcing users to structure their code a certain way. That seems very reasonable to me

ukclivecox · 2018-08-07T18:58:38Z

We use s2i at Seldon so am also interested to know to of techniques to package models from a notebook easily.

wbuchwalter · 2018-08-08T19:24:04Z

@jlewi I looked into Knative's Build CRD. The main issue is that we need to somehow pass the source to the builder to be used as context, 3½ ways are currently supported by the CRD:

git: Since we want users to be able to launch training directly from a Notebook we can't expect them to commit through the notebook everytime they make a change. The only solution I can think of is running a local temporary git server and commit automatically on every call to Train (also that would assume that we have a service exposing it). We would then reference this git server in the Knative Build definition. This seems a bit brittle and complex though.
gcs: only works on GKE.
custom: This allows us to specify a custom image that would be responsible for gathering the source. We could use this to to talk to the Notebook's API, using /api/contents and /api/contents/<filename> we should be able to extract all the sources, but how can we get the credentials to access the API?
There is also a way to mount volumes into the builder (I think), but how can we ensure that every K8s cluster where we are running has access to a PV that allows ReadWriteMany?

jlewi · 2018-08-09T22:51:10Z

@wbuchwalter Why isn't volume support sufficient?
https://github.com/knative/docs/blob/master/build/builds.md#using-an-extra-volume

There's lots of ways for people to create solutions for their particular environment. For example, you can use the NFS Provisioner to run NFS within cluster.

You can use an empty dir volume as a cache across steps. So one solution is to use object storage and just copy it to the cache in a step that runs before the build.

wbuchwalter · 2018-08-10T19:26:06Z

@jlewi, If we are ok with introducing a new requirement (a ReadWriteMany PV) then yes I don't see any issue with that.
Ideally, a solution with no extra requirement is still better IMHO, but I don't have one...
For emptyDir yes it could work, but then we would need to ensure the notebook and the builder are running on the same node.
Since users can have nodes with arbitrary capacities and can request arbitrary resources for their notebook, we can't really guarantee that the builder will fit on the same node. What do you think?

inc0 · 2018-08-10T19:57:56Z

@wbuchwalter alternative is S3 - it's fairly popular protocol and I think every major cloud (not sure about Azure) + on prem have good solutions for it. RWX is expensive thing to maintain, NFS isn't particularly great.
That comes back to idea I keep talking about - storage backend for kubeflow....

jlewi · 2018-08-14T13:13:01Z

@wbuchwalter I think something that works with ReadWriteMany is fine.

@inc0 as noted above I think anyone with object storage can use that to pass around a tarball containing the build context.

jlewi · 2018-10-14T22:29:07Z

@wbuchwalter any update on this?

/cc @r2d4

wbuchwalter · 2018-10-22T16:56:37Z

@jlewi Licence approval should be done in the next few days, and then I will be able to transfer the repository.

Also, with JupyterHub, I cannot find a proper way to infer the name of the currently running notebook (which is needed to know which notebook should be started).
With standalone notebook I can just call the API to get that.

I have a PR opened to allow users to specify the name of the notebook themselves when using JupyterHub, it works but it's obviously not the best UX.

#1630 would be a good way to avoid this issue entirely.

lresende · 2018-10-22T17:04:56Z

Why does JupyterHub play a part here? Can't you just configure it to launch a custom notebook image that then has your extension installed and you then can use the Javascript API to see the contents and name of the running Notebook?

wbuchwalter · 2018-10-22T20:50:43Z

@lresende I am no expert in Jupyter, so there may be an easy solution to do this that I am not aware of (that would be great).

Currently to get the name of the notebook I am using the notebook API (see https://github.com/wbuchwalter/fairing/blob/master/fairing/notebook_helper.py#L9). This works well with standalone notebook with token auth.
However when using JupyterHub, authentication is based on user/password, which prevents me from accessing the API.

I haven't looked at the Javascript API, I will try to investigate.

jlewi · 2018-10-23T04:53:00Z

@r2d4 how did you solve this problem?

wbuchwalter · 2018-10-23T13:45:26Z

I have a PR opened that adds an argument to the the decorator where users can pass the name of the notebook.
If not running in JupyterHub, this argument can be omitted and fairing will infer the name by itself.

r2d4 · 2018-10-23T15:59:19Z

I solved in a similar way. I'm not sure if it will work in more advanced scenarios.

chrisheecho · 2018-10-25T15:28:38Z

/remove-priority p1

jlewi · 2018-10-25T15:33:14Z

@chrisheecho How come you removed P1? I think we really want to get this done in 0.4.0

I filed a couple of issues to split this up

#1857 Tooling to go from notebook to docker container
#1858 Widget to train/deploy

/label priority/p1

chrisheecho · 2018-10-25T17:22:34Z

I removed the priority on all the issues that we have not talked about together with eng+pm thinking we would go through it. I agree though that it should be at least P1 if not P0

chrisheecho · 2018-10-25T17:22:40Z

/priority p1

richardsliu · 2018-11-07T20:30:57Z

/assign @r2d4

jlewi · 2018-11-19T18:10:37Z

@r2d4 What are the next steps here? Can we just use the K8s client library to fire off TFJobs from a notebook? Do we need a higher level SDK to make this easy?

r2d4 · 2018-11-26T17:08:22Z

The fairing code is a small python library that needs to be installed in the notebook image. It handles both building the image (through kaniko, docker, or the “append” strategy) and deploying to TFJob

jlewi · 2018-12-03T18:22:02Z

@r2d4 So can we go ahead and close this issue?

* Add documentation of kustomize version * Update README.md

jlewi added priority/p2 area/jupyter Issues related to Jupyter labels Jul 19, 2018

This was referenced Aug 6, 2018

How to port model developed in Jupyter notebook to TFJobs #110

Closed

Deploy knative for in cluster image builds #1317

Closed

jlewi mentioned this issue Aug 9, 2018

[Request] Repository for Arena kubeflow/community#164

Closed

jlewi added area/0.4.0 priority/p1 and removed priority/p2 labels Oct 14, 2018

k8s-ci-robot removed the priority/p1 label Oct 25, 2018

k8s-ci-robot added the priority/p1 label Oct 25, 2018

wbuchwalter mentioned this issue Nov 5, 2018

Migrate wbuchwalter/fairing kubeflow/fairing#5

Merged

carmine added this to the 0.4.0 milestone Nov 6, 2018

k8s-ci-robot assigned r2d4 Nov 7, 2018

richardsliu removed their assignment Nov 7, 2018

r2d4 closed this as completed Dec 3, 2018

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021

Delete fake suggestion interfaces (kubeflow#1240)

913d55a

surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022

Document recommendation of kustomize version (kubeflow#1240)

da59c83

* Add documentation of kustomize version * Update README.md

Fire off TFJob from Jupyter Notebook #1240

Fire off TFJob from Jupyter Notebook #1240

Comments

jlewi commented Jul 19, 2018

wbuchwalter commented Jul 19, 2018

wbuchwalter commented Jul 24, 2018

jlewi commented Jul 24, 2018

jlewi commented Jul 25, 2018

jlewi commented Jul 25, 2018

inc0 commented Jul 26, 2018 via email

wbuchwalter commented Jul 27, 2018

jlewi commented Jul 27, 2018

jlewi commented Jul 27, 2018

wbuchwalter commented Jul 27, 2018

jlewi commented Jul 29, 2018

wbuchwalter commented Jul 30, 2018 • edited Loading

jlewi commented Jul 30, 2018

aronchick commented Jul 30, 2018

wbuchwalter commented Jul 31, 2018 • edited Loading

jlewi commented Aug 6, 2018

willb commented Aug 6, 2018

wbuchwalter commented Aug 7, 2018

jlewi commented Aug 7, 2018

ukclivecox commented Aug 7, 2018

wbuchwalter commented Aug 8, 2018 • edited Loading

jlewi commented Aug 9, 2018

wbuchwalter commented Aug 10, 2018

inc0 commented Aug 10, 2018

jlewi commented Aug 14, 2018

jlewi commented Oct 14, 2018

wbuchwalter commented Oct 22, 2018

lresende commented Oct 22, 2018

wbuchwalter commented Oct 22, 2018

jlewi commented Oct 23, 2018

wbuchwalter commented Oct 23, 2018

r2d4 commented Oct 23, 2018

chrisheecho commented Oct 25, 2018

jlewi commented Oct 25, 2018

chrisheecho commented Oct 25, 2018

chrisheecho commented Oct 25, 2018

richardsliu commented Nov 7, 2018

jlewi commented Nov 19, 2018

r2d4 commented Nov 26, 2018

jlewi commented Dec 3, 2018

wbuchwalter commented Jul 30, 2018 •

edited

Loading

wbuchwalter commented Jul 31, 2018 •

edited

Loading

wbuchwalter commented Aug 8, 2018 •

edited

Loading