Skip to content

Commit

Permalink
[GH Issue Summarization] Upgrade to kf v0.4.0-rc.2 (kubeflow#450)
Browse files Browse the repository at this point in the history
* Update tfjob components to v1beta1

Remove old version of tensor2tensor component

* Combine UI into a single jsonnet file

* Upgrade GH issue summarization to kf v0.4.0-rc.2

Use latest ksonnet v0.13.1
Use latest seldon v1alpha2
Remove ksonnet app with full kubeflow platform & replace with components specific to this example.
Remove outdated scripts
Add cluster creation links to Click-to-deploy & kfctl
Add warning not to use the Training with an Estimator guide
Replace commandline with bash for better syntax highlighting
Replace messy port-forwarding commands with svc/ambassador
Add modelUrl param to ui component
Modify teardown instructions to remove the deployment
Fix grammatical mistakes

* Rearrange tfjob instructions
  • Loading branch information
texasmichelle authored and k8s-ci-robot committed Dec 31, 2018
1 parent 7990408 commit 70a22d6
Show file tree
Hide file tree
Showing 107 changed files with 385 additions and 86,534 deletions.
134 changes: 85 additions & 49 deletions github_issue_summarization/01_setup_a_kubeflow_cluster.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,52 @@
# Setup Kubeflow

In this part, you will setup kubeflow on an existing kubernetes cluster.
In this section, you will setup Kubeflow on an existing Kubernetes cluster.

## Requirements

* A kubernetes cluster
* To create a managed cluster run
```commandline
gcloud container clusters create kubeflow-examples-cluster
```
or use kubeadm: [docs](https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/)
* `kubectl` CLI (command line interface) pointing to the kubernetes cluster
* A Kubernetes cluster
* To create a cluster, follow the instructions on the
[Set up Kubernetes](https://www.kubeflow.org/docs/started/getting-started/#set-up-kubernetes)
section of the Kubeflow Getting Started guide. We recommend using a
managed service such as Google Kubernetes Engine (GKE).
[This link](https://www.kubeflow.org/docs/started/getting-started-gke/)
guides you through the process of using either
[Click-to-Deploy](https://deploy.kubeflow.cloud/#/deploy) (a web-based UI) or
[`kfctl`](https://github.com/kubeflow/kubeflow/blob/master/scripts/kfctl.sh)
(a CLI tool) to generate a GKE cluster with all Kubeflow components
installed. Note that there is no need to complete the Deploy Kubeflow steps
below if you use either of these two tools.
* The Kubernetes CLI `kubectl` pointing to the kubernetes cluster
* Make sure that you can run `kubectl get nodes` from your terminal
successfully
* The ksonnet CLI, v0.9.2 or higher: [ks](https://ksonnet.io/#get-started)
* The ksonnet CLI [`ks`](https://ksonnet.io/#get-started), v0.9.2 or higher:
* In case you want to install a particular version of ksonnet, you can run

```commandline
export KS_VER=ks_0.11.0_linux_amd64
wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.11.0/$KS_VER.tar.gz
```bash
export KS_VER=0.13.1
export KS_BIN=ks_${KS_VER}_linux_amd64
wget -O /tmp/${KS_BIN}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_BIN}.tar.gz
mkdir -p ${HOME}/bin
tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin
export PATH=$PATH:${HOME}/bin/$KS_VER
tar -xvf /tmp/${KS_BIN}.tar.gz -C ${HOME}/bin
export PATH=$PATH:${HOME}/bin/${KS_BIN}
```

## Kubeflow setup

Refer to the [
guide](https://www.kubeflow.org/docs/started/getting-started/) for
detailed instructions on how to setup kubeflow on your kubernetes cluster.
Refer to the [guide](https://www.kubeflow.org/docs/started/getting-started/) for
detailed instructions on how to setup Kubeflow on your Kubernetes cluster.
Specifically, complete the following sections:

* [Deploy
Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
* The [ks-kubeflow](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow)
directory can be used instead of creating a ksonnet app from scratch.
* If you run into
[API rate limiting errors](https://github.com/ksonnet/ksonnet/blob/master/docs/troubleshooting.md#github-rate-limiting-errors), ensure you have a `${GITHUB_TOKEN}` environment variable set.
* If you run into [RBAC permissions issues](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md#rbac-clusters)
running `ks apply` commands, be sure you have created a `cluster-admin` ClusterRoleBinding for your username.
* [Deploy Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
* The latest version that was tested with this walkthrough was v0.4.0-rc.2.
* The [`kfctl`](https://github.com/kubeflow/kubeflow/blob/master/scripts/kfctl.sh)
CLI tool can be used to install Kubeflow on an existing cluster. Follow
[this guide](https://www.kubeflow.org/docs/started/getting-started/#kubeflow-quick-start)
to use `kfctl` to generate a ksonnet app, create Kubeflow manifests, and
install all default components onto an existing Kubernetes cluster. Note
that you can likely skip this step if you used
[Click-to-Deploy](https://deploy.kubeflow.cloud/#/deploy)
or `kfctl` to generate your cluster.

* [Setup a persistent disk](https://www.kubeflow.org/docs/guides/advanced/)

Expand All @@ -49,9 +56,9 @@ Kubeflow](https://www.kubeflow.org/docs/started/getting-started/)
* For this example, provision a `10GB` cluster-wide shared NFS mount with the
name `github-issues-data`.

* After the NFS is ready, delete the `tf-hub-0` pod so that it gets recreated and
* After the NFS is ready, delete the `jupyter-0` pod so that it gets recreated and
picks up the NFS mount. You can delete it by running `kubectl delete pod
tf-hub-0 -n=${NAMESPACE}`
jupyter-0 -n=${NAMESPACE}`

* [Bringing up a
Notebook](https://www.kubeflow.org/docs/guides/components/jupyter/)
Expand All @@ -62,19 +69,44 @@ Notebook](https://www.kubeflow.org/docs/guides/components/jupyter/)

After completing that, you should have the following ready:

* A ksonnet app in a directory named `ks-kubeflow`
* An output similar to this for `kubectl get pods` command
```commandline
NAME READY STATUS RESTARTS AGE
ambassador-75bb54594-dnxsd 2/2 Running 0 3m
ambassador-75bb54594-hjj6m 2/2 Running 0 3m
ambassador-75bb54594-z948h 2/2 Running 0 3m
jupyter-chasm 1/1 Running 0 49s
spartakus-volunteer-565b99cd69-knjf2 1/1 Running 0 3m
tf-hub-0 1/1 Running 0 3m
tf-job-dashboard-6c757d8684-d299l 1/1 Running 0 3m
tf-job-operator-77776c8446-lpprm 1/1 Running 0 3m
* A ksonnet app in a directory named `ks_app`
* An output similar to this for `kubectl -n kubeflow get pods` command

```bash
NAME READY STATUS RESTARTS AGE
ambassador-5cf8cd97d5-6qlpz 1/1 Running 0 3m
ambassador-5cf8cd97d5-rqzkx 1/1 Running 0 3m
ambassador-5cf8cd97d5-wz9hl 1/1 Running 0 3m
argo-ui-7c9c69d464-xpphz 1/1 Running 0 3m
centraldashboard-6f47d694bd-7jfmw 1/1 Running 0 3m
cert-manager-5cb7b9fb67-qjd9p 1/1 Running 0 3m
cm-acme-http-solver-2jr47 1/1 Running 0 3m
ingress-bootstrap-x6whr 1/1 Running 0 3m
jupyter-0 1/1 Running 0 3m
jupyter-chasm 1/1 Running 0 49s
katib-ui-54b4667bc6-cg4jk 1/1 Running 0 3m
metacontroller-0 1/1 Running 0 3m
minio-7bfcc6c7b9-qrshc 1/1 Running 0 3m
ml-pipeline-b59b58dd6-bwm8t 1/1 Running 0 3m
ml-pipeline-persistenceagent-9ff99498c-v4k8f 1/1 Running 0 3m
ml-pipeline-scheduledworkflow-78794fd86f-4tzxp 1/1 Running 0 3m
ml-pipeline-ui-9884fd997-7jkdk 1/1 Running 0 3m
ml-pipelines-load-samples-668gj 0/1 Completed 0 3m
mysql-6f6b5f7b64-qgbkz 1/1 Running 0 3m
pytorch-operator-6f87db67b7-nld5h 1/1 Running 0 3m
spartakus-volunteer-7c77dc796-7jgtd 1/1 Running 0 3m
studyjob-controller-68c6fc5bc8-jkc9q 1/1 Running 0 3m
tf-job-dashboard-5f986cf99d-kb6gp 1/1 Running 0 3m
tf-job-operator-v1beta1-5876c48976-q96nh 1/1 Running 0 3m
vizier-core-78f57695d6-5t8z7 1/1 Running 0 3m
vizier-core-rest-7d7dd7dbb8-dbr7n 1/1 Running 0 3m
vizier-db-777675b958-c46qh 1/1 Running 0 3m
vizier-suggestion-bayesianoptimization-7f46d8cb47-wlltt 1/1 Running 0 3m
vizier-suggestion-grid-64c5f8bdf-2bznv 1/1 Running 0 3m
vizier-suggestion-hyperband-8546bf5885-54hr6 1/1 Running 0 3m
vizier-suggestion-random-c4c8d8667-l96vs 1/1 Running 0 3m
whoami-app-7b575b555d-85nb8 1/1 Running 0 3m
workflow-controller-5c95f95f58-hprd5 1/1 Running 0 3m
```

* A Jupyter Notebook accessible at http://127.0.0.1:8000
Expand All @@ -83,10 +115,14 @@ tf-job-operator-77776c8446-lpprm 1/1 Running 0

## Summary

* We created a ksonnet app for our kubeflow deployment
* We deployed the kubeflow-core component to our kubernetes cluster
* We created a disk for storing our training data
* We connected to JupyterHub and spawned a new Jupyter notebook
* For additional details and self-paced learning scenarios check `Resources` section of the [getting started guide](https://www.kubeflow.org/docs/started/getting-started/)

*Next*: [Training the model](02_training_the_model.md)
* We created a ksonnet app for our kubeflow deployment: `ks_app`.
* We deployed the default Kubeflow components to our Kubernetes cluster.
* We created a disk for storing our training data.
* We connected to JupyterHub and spawned a new Jupyter notebook.
* For additional details and self-paced learning scenarios related to this
example, see the
[Resources](https://www.kubeflow.org/docs/started/getting-started/#resources)
section of the
[Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/).

*Next*: [Training the model with a notebook](02_training_the_model.md)
21 changes: 14 additions & 7 deletions github_issue_summarization/02_distributed_training.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,26 @@
# Distributed training using Estimator

Distributed training with keras currently doesn't work; see
Distributed training with Keras currently does not work. Do not follow this guide
until these issues have been resolved:

* kubeflow/examples#280
* kubeflow/examples#96
* [kubeflow/examples#280](https://github.com/kubeflow/examples/issues/280)
* [kubeflow/examples#196](https://github.com/kubeflow/examples/issues/196)

Requires Tensorflow 1.9 or later.
Requires TensorFlow 1.9 or later.
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.

On GKE you can follow [GCFS documentation](https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow) to enable it.

Estimator and Keras are both part of Tensorflow. These high level APIs are designed
to make building models easier. In our distributed training example we will show how both
Estimator and Keras are both part of TensorFlow. These high-level APIs are designed
to make building models easier. In our distributed training example, we will show how both
APIs work together to help build models that will be trainable in both single node and
distributed manner.

## Keras and Estimators

Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.
Code required to run this example can be found in the
[distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed)
directory.

You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators).
In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to
Expand Down Expand Up @@ -93,3 +96,7 @@ tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/pr
## Model

After training is complete, our model can be found in "model" PVC.

*Next*: [Serving the model](03_serving_the_model.md)

*Back*: [Setup a kubeflow cluster](01_setup_a_kubeflow_cluster.md)
18 changes: 9 additions & 9 deletions github_issue_summarization/02_training_the_model.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Training the model
# Training the model with a notebook

By this point, you should have a Jupyter Notebook running at http://127.0.0.1:8000.
By this point, you should have a Jupyter notebook running at http://127.0.0.1:8000.

## Download training files

Open the Jupyter Notebook interface and create a new Terminal by clicking on
menu, *New -> Terminal*. In the Terminal, clone this git repo by executing: `
Open the Jupyter notebook interface and create a new Terminal by clicking on
menu, *New -> Terminal*. In the Terminal, clone this git repo by executing:

```commandline
git clone https://github.com/kubeflow/examples.git`
```bash
git clone https://github.com/kubeflow/examples.git
```

Now you should have all the code required to complete training in the `examples/github_issue_summarization/notebooks` folder. Navigate to this folder.
Expand All @@ -19,7 +19,7 @@ Here you should see two files:

## Perform training

Open th `Training.ipynb` notebook. This contains a complete walk-through of
Open the `Training.ipynb` notebook. This contains a complete walk-through of
downloading the training data, preprocessing it, and training it.

Run the `Training.ipynb` notebook, viewing the output at each step to confirm
Expand All @@ -44,9 +44,9 @@ kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issu
kubectl --namespace=${NAMESPACE} cp ${PODNAME}:/home/jovyan/examples/github_issue_summarization/notebooks/title_pp.dpkl .
```

For information on:
_(Optional)_ You can also perform training with two alternate methods:
- [Training the model using TFJob](02_training_the_model_tfjob.md)
- [Distributed training using tensor2tensor](02_tensor2tensor_training.md)
- [Distributed training using Estimator](02_distributed_training.md)

*Next*: [Serving the model](03_serving_the_model.md)

Expand Down
74 changes: 39 additions & 35 deletions github_issue_summarization/02_training_the_model_tfjob.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,35 @@
# Training the model using TFJob

Kubeflow offers a TensorFlow job controller for kubernetes. This allows you to run your distributed Tensorflow training
job on a kubernetes cluster. For this training job, we will read our training data from GCS and write our output model
Kubeflow offers a TensorFlow job controller for Kubernetes. This allows you to run your distributed Tensorflow training
job on a Kubernetes cluster. For this training job, we will read our training
data from Google Cloud Storage (GCS) and write our output model
back to GCS.

## Create the image for training

The [notebooks](notebooks) directory contains the necessary files to create a image for training. The [train.py](notebooks/train.py) file contains the training code. Here is how you can create an image and push it to gcr.
The [notebooks](notebooks) directory contains the necessary files to create an
image for training. The [train.py](notebooks/train.py) file contains the
training code. Here is how you can create an image and push it to Google
Container Registry (GCR):

```commandline
```bash
cd notebooks/
make PROJECT=${PROJECT} set-image
```
## Train Using PVC

If you don't have access to GCS or don't want to use GCS you
can use a persistent volume to store the data and model.
If you don't have access to GCS or do not wish to use GCS, you
can use a Persistent Volume Claim (PVC) to store the data and model.

Create a pvc
Note: your cluster must have a default storage class defined for this to work.
Create a PVC:

```
ks apply --env=${KF_ENV} -c data-pvc
```

* Your cluster must have a default storage class defined for
this to work.

Run the job to download the data to the PVC.

Run the job to download the data to the PVC:

```
ks apply --env=${KF_ENV} -c data-downloader
Expand All @@ -38,24 +41,24 @@ Submit the training job
ks apply --env=${KF_ENV} -c tfjob-pvc
```

The resulting model will be stored on PVC so to access it you will
need to run a pod and attach the PVC. For serving you can just
attach it the pod serving the model.
The resulting model will be stored on the PVC, so to access it you will
need to run a pod and attach the PVC. For serving, you can just
attach it to the pod serving the model.

## Training Using GCS

If you are running on GCS you can train using GCS to store the input
If you are using GCS, you can train using GCS to store the input
and the resulting model.

### GCS Service account
### GCS service account

* Create a service account which will be used to read and write data from the GCS Bucket.
* Create a service account that will be used to read and write data from the GCS bucket.

* Give the storage account `roles/storage.admin` role so that it can access GCS Buckets.
* Give the storage account `roles/storage.admin` role so that it can access GCS buckets.

* Download its key as a json file and create a secret named `user-gcp-sa` with the key `user-gcp-sa.json`

```commandline
```bash
SERVICE_ACCOUNT=github-issue-summarization
PROJECT=kubeflow-example-project # The GCP Project name
gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \
Expand All @@ -74,12 +77,12 @@ kubectl --namespace=${NAMESPACE} create secret generic user-gcp-sa --from-file=u

### Run the TFJob using your image

[ks-kubeflow](ks-kubeflow) contains a ksonnet app to deploy the TFJob.
[ks_app](ks_app) contains a ksonnet app to deploy the TFJob.

Set the appropriate params for the tfjob component
Set the appropriate params for the tfjob component:

```commandline
cd ks-kubeflow
```bash
cd ks_app
ks param set tfjob namespace ${NAMESPACE} --env=${KF_ENV}

# The image pushed in the previous step
Expand All @@ -97,30 +100,31 @@ ks param set tfjob output_model_gcs_path "github-issue-summarization-data/output

Deploy the app:

```commandline
```bash
ks apply ${KF_ENV} -c tfjob
```

In a while you should see a new pod with the label `tf_job_name=tf-job-issue-summarization`
```commandline
kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization
```bash
kubectl get pods -n=${NAMESPACE} tfjob-issue-summarization-master-0
```

You can view the logs of the tf-job operator using
You can view the training logs using

```commandline
kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
```bash
kubectl logs -f -n=${NAMESPACE} tfjob-issue-summarization-master-0
```

You can view the actual training logs using
You can view the logs of the tf-job operator using

```commandline
kubectl logs -f $(kubectl get pods -n=${NAMESPACE} -ltf_job_name=tf-job-issue-summarization -o=jsonpath='{.items[0].metadata.name}')
```bash
kubectl logs -f -n=${NAMESPACE} $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}')
```

For information on:
- [Training the model](02_training_the_model.md)
- [Distributed training using tensor2tensor](02_tensor2tensor_training.md)

_(Optional)_ You can also perform training with two alternate methods:
- [Training the model with a notebook](02_training_the_model.md)
- [Distributed training using Estimator](02_distributed_training.md)

*Next*: [Serving the model](03_serving_the_model.md)

Expand Down
Loading

0 comments on commit 70a22d6

Please sign in to comment.