-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fire off TFJob from Jupyter Notebook #1240
Comments
👍 Booking some time to spike MetaML for this use case in the coming days. Will report back on my findings. |
So it seems like it could work well.
Other remarks:
|
This looks very promising. Were you running the notebook on your local machine or in the K8s cluster? On K8s any idea how we would build the container? |
The newly released build CRD Looks promising. |
Some offline discussion with @wbuchwalter. Using metaml might be more than we need and its blocked because metaparticle doesn't support CRDs. Especially if we take advantage of knative to do the container build we don't need that aspect of the metaparticle tool chain. |
My testing so far with estimators looks promising in context of using same
code for single threaded (notebook) and distributed. I think we'll need
users to create certain code standards tho.
1 we need to provide ability to have 2 different datasets for notebook and
distributed
2 we may need different location for things like tf.event files, logs or
saved model
I think both can be solved by using env variables and communicate how to
use them for your benefit.
…On Wed, Jul 25, 2018, 4:36 PM Jeremy Lewi ***@***.***> wrote:
Some offline discussion with @wbuchwalter <https://github.com/wbuchwalter>
.
Using metaml might be more than we need and its blocked because
metaparticle doesn't support CRDs.
So a custom script might be simpler.
Especially if we take advantage of knative to do the container build we
don't need that aspect of the metaparticle tool chain.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1240 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAqkuJbqG9dwCEt9qaF3cBprxOS-aGBjks5uKQEDgaJpZM4VWZmX>
.
|
@jlewi Note that the build is happening in MetaML not in metaparticle. so we can easily plug knative there. |
@wbuchwalter I think people would really love this so my priority would be getting something people could play with as soon as possible. So w.r.t to using MetaMl/MetaParticle or custom scripts; which ever is faster SGTM. w.r.t the larger question about whether MetaParticle is the right solution
What does the DSL look like? How would this compare to a python object that has the same structure as the underlying K8s resource maybe with some helper functions. For example, suppose we add a TFJob python class that had properties fields that matched TFJob spec so you could do things like
And maybe some sugar methods to make common modifications easier e.g.
Maybe what we really need is for K8s to provide better tooling for auto-generating native Python libraries. |
@wbuchwalter I took a closer look and I think you're right and we should start with MetaMl. train.py looks like what I had in mind. So I don't see any reason not to start with that. I'm still unsure about metaparticle long term. It looks like metaml just uses metaparticle here to generate this dictionary matching the TFJob spec. Why not just bypass metaparticle and create the spec directly? It seems like it might just be a layer of indirection that we don't need since we don't want to target non-K8s architectures. |
@jlewi This is where I specify the TfJob specs to pass to metaparticle: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/architectures/kubeflow/distributed.py#L13 I don't think you should see metaparticle's abstraction as a way to target multiple architecture. So for simple jobs, metaparticle doesn't bring much to the table since you are really only deploying a single resource, but for more complex use cases, it allows you do deploy components pretty succintly. For example, for PBT, I am also deploying a redis server: https://github.com/wbuchwalter/metaml/blob/master/metaml/strategies/pbt/pbt.py#L70 From the perspective of MetaML, since metaparticle supports a subset (at least today) of the native kubernetes objects, it allows users to write their own custom training strategies (just implement a exec_user_code method, see: https://github.com/wbuchwalter/metaml/blob/jupyter/metaml/strategies/hp.py#L15) and it should work without having to wait for the underlying platform to provide specific objects for this kind of training. So overall I think MetaML solves two separate pain-points today:
So while this issue only really cares about 1., I think it would be a shame to not also get the benefits of 2. when it's already there. |
@wbuchwalter SGTM. What are the next steps to allow users to start submitting jobs from a notebook as in your example
|
|
@wbuchwalter Thanks; I think the build CRD is the way to go because to some extent it abstracts away the builder. The BUILD CRD basically takes a YAML spec describing a workflow to build the image. As part of the spec you can specify which (supported) builder to use e.g. kaniko. So by using the build CRD we should be able to sidestep the debate of buildah vs. kaniko vs. img. If someone wants to use a particular tool they just have to integrate it into the build CRD. |
The other half of this, I think, offering a template "notebook" which could be auto-stripped/broken down to ease building into the trainable image. I've talked with a few different folks who are interested in this, lmk if you'd like to take it on. |
@aronchick currently I'm using nbconvert, which convert the .ipynb into a .py file in the order the cell are defined. What would be a better solution in your opinion? I wasn't planning on going further on this topic, so if anyone has a better idea, feel free. |
@jlewi thanks! I've been following this issue but will take the opportunity to join the discussion. @wbuchwalter I think |
@willb Thanks, I actually spent some times trying to play with So far with MetaML (just renamed to I am targeting to deliver an official pip package + a way to easily install everything in a notebook this week, if you have time to try it out, I would be very interested in hearing your feedback and how |
I like @wbuchwalter's approach of simplifying the problem by forcing users to structure their code a certain way. That seems very reasonable to me |
We use s2i at Seldon so am also interested to know to of techniques to package models from a notebook easily. |
@jlewi I looked into Knative's Build CRD. The main issue is that we need to somehow pass the source to the builder to be used as context, 3½ ways are currently supported by the CRD:
|
@wbuchwalter Why isn't volume support sufficient? There's lots of ways for people to create solutions for their particular environment. For example, you can use the NFS Provisioner to run NFS within cluster. You can use an empty dir volume as a cache across steps. So one solution is to use object storage and just copy it to the cache in a step that runs before the build. |
@jlewi, If we are ok with introducing a new requirement (a |
@wbuchwalter alternative is S3 - it's fairly popular protocol and I think every major cloud (not sure about Azure) + on prem have good solutions for it. RWX is expensive thing to maintain, NFS isn't particularly great. |
@wbuchwalter I think something that works with ReadWriteMany is fine. @inc0 as noted above I think anyone with object storage can use that to pass around a tarball containing the build context. |
@wbuchwalter any update on this? /cc @r2d4 |
@jlewi Licence approval should be done in the next few days, and then I will be able to transfer the repository. Also, with JupyterHub, I cannot find a proper way to infer the name of the currently running notebook (which is needed to know which notebook should be started). I have a PR opened to allow users to specify the name of the notebook themselves when using JupyterHub, it works but it's obviously not the best UX. #1630 would be a good way to avoid this issue entirely. |
Why does JupyterHub play a part here? Can't you just configure it to launch a custom notebook image that then has your extension installed and you then can use the Javascript API to see the contents and name of the running Notebook? |
@lresende I am no expert in Jupyter, so there may be an easy solution to do this that I am not aware of (that would be great). Currently to get the name of the notebook I am using the notebook API (see https://github.com/wbuchwalter/fairing/blob/master/fairing/notebook_helper.py#L9). This works well with standalone notebook with token auth. I haven't looked at the Javascript API, I will try to investigate. |
@r2d4 how did you solve this problem? |
I have a PR opened that adds an argument to the the decorator where users can pass the name of the notebook. |
I solved in a similar way. I'm not sure if it will work in more advanced scenarios. |
/remove-priority p1 |
@chrisheecho How come you removed P1? I think we really want to get this done in 0.4.0 I filed a couple of issues to split this up #1857 Tooling to go from notebook to docker container /label priority/p1 |
I removed the priority on all the issues that we have not talked about together with eng+pm thinking we would go through it. I agree though that it should be at least P1 if not P0 |
/priority p1 |
/assign @r2d4 |
@r2d4 What are the next steps here? Can we just use the K8s client library to fire off TFJobs from a notebook? Do we need a higher level SDK to make this easy? |
The fairing code is a small python library that needs to be installed in the notebook image. It handles both building the image (through kaniko, docker, or the “append” strategy) and deploying to TFJob |
@r2d4 So can we go ahead and close this issue? |
* Add documentation of kustomize version * Update README.md
We'd like to make it super easy to go from writing code in a notebook to training that model distributed.
Experience might be something like
Under the hood this would cause
I think the biggest challenge is that we probably don't want to execute all code in the notebook. Typically, there's some amount of refactoring that needs to be done to convert a notebook into a python module suitable for execution in a bash job.
As a concrete example
Here's the notebook for our GitHub Issue summarization example
Here's the corresponding python module used when training in a K8s job.
The python module only executes a subset of cells in particular those to
Rather than try to auto-convert a notebook like the github issue example, I think we should require users structure their code to facilitate the conversion.
My suggestion would be to allow any functions defined in the notebook to be used as entry points. So for the GitHub issues summarization the user would have a cell like the following
If user structures their code this way, we should be able to manually create and invoke a suitable container entry point. Something like the following
A variant of this idea would be to use metaml (by @wbuchwalter ). metaml uses metaparticle to allow people to annotate their python code with information needed to then run it on K8s (e.g. distributed using TFJob). If we went this approach I think the flow would be
@willingc @yuvipanda Is there existing tooling in the Jupyter community other than nbconvert to convert notebooks to code suitable for asynchronous batch execution?
/cc @wbuchwalter @gaocegege @yuvipanda @willingc
The text was updated successfully, but these errors were encountered: