In this example we are going to convert this generic notebook based on the Kaggle JPX Tokyo Stock Exchange Prediction competition into a Kubeflow pipeline.
The objective of this task is to correctly model real future returns of around 2,000 stocks. The stocks are ranked from highest to lowest expected returns and they are evaluated on the difference in returns between the top and bottom 200 stocks.
Environment:
Name | version |
---|---|
Kubeflow | v1.4 |
kfp | 1.8.11 |
kubeflow-kale | 0.6.0 |
pip | 21.3.1 |
kaggle | 1.5.12 |
-
Vanilla KFP Pipeline: Kubeflow lightweight component method
To get started, visit the Kubeflow Pipelines documentation to get acquainted with what pipelines are, its components, pipeline metrics and how to pass data between components in a pipeline. There are different ways to build out a pipeline component as mentioned here. In the following example, we are going to use the lightweight python functions based components for building our Kubeflow pipeline.
-
Kale KFP Pipeline
To get started, visit Kale's documentation to get acquainted with the Kale user interface (UI) from a Jupyter Notebook, notebook cell annotation and how to create a machine learning pipeline using Kale. In the following example, we are going to use the Kale JupyterLab Extension to building our Kubeflow pipeline.
-
Open your Kubeflow Cluster, create a Notebook Server and connect to it.
-
Download the JPX dataset using Kaggle's API. To do this, do the following:
- Login to Kaggle and click on your user profile picture.
- Click on ‘Account’.
- Under ‘Account’, navigate to the ‘API’ section.
- Click ‘Create New API token’.
- After creating a new API token, a kaggle.json file is automatically downloaded, and the json file contains the ‘api-key’ and ‘username’ needed to download the dataset.
- Create a Kubernetes secret to handle the sensitive API credentials and to prevent you from passing your credentials in plain text to the pipeline notebook.
!kubectl create secret generic -n kubeflow-user kaggle-secret --from-literal=username=<"username"> --from-literal=password=<"api-key">
- Create a secret PodDefault YAML file in your Kubeflow namespace.
apiVersion: "kubeflow.org/v1alpha1" kind: PodDefault metadata: name: kaggle-secret namespace: kubeflow-user spec: selector: matchLabels: kaggle-secret: "true" desc: "kaggle-secret" volumeMounts: - name: secret-volume mountPath: /secret/kaggle-secret readOnly: false volumes: - name: secret-volume secret: secretName: kaggle-secret
- Apply the pod YAML file
kubectl apply -f kaggle_pod.yaml
- After successfully deploying the PodDefault, create a new Notebook Server and add the
kaggle-secret
configuration to the new Notebook Server that runs kale or kfp pipeline.
Here, a python function is created to carry out a certain task and the python function is passed inside a kfp component method create_component_from_func
.
The different components used in this example are:
- Load data
- Transform data
- Feature Engineering
- Modelling
- Prediction
A Kubeflow pipeline connects all components together, to create a directed acyclic graph (DAG). The kfp dsl.pipeline
decorator was used to create a pipeline function.
The kfp component method InputPath
and OutputPath
was used to pass data between components in the pipeline.
Finally, the create_run_from_pipeline_func
from the KFP SDK Client was used to submit pipeline directly from pipeline function
-
Open your Kubeflow Cluster, create a new Notebook Server and add the
kaggle-secret
configuration to the new Notebook Server. -
Create a new Terminal and clone this repo. After cloning, navigate to this directory.
-
Open the jpx-tokyo-stock-exchange-prediction-kfp notebook
-
Run the jpx-tokyo-stock-exchange-prediction-kfp notebook from start to finish
-
View run details immediately after submitting pipeline.
To create a KFP pipeline using the Kale JupyterLab extension
-
Open your Kubeflow Cluster, create a new Notebook Server and add the
kaggle-secret
configuration to the new Notebook Server. -
Create a new Terminal and clone this repo. After cloning, navigate to this directory.
-
Launch the jpx-tokyo-stock-exchange-prediction-kale.ipynb Notebook
-
Install the requirements.txt file. After installation, restart the kernel.
-
Enable the Kale extension in JupyterLab
-
The notebook's cells are automatically annotated with Kale tags
To fully understand the different Kale tags available, visit Kale documentation
The following Kale tags were used in this example:
- Imports
- Pipeline Parameters
- Pipeline Metrics
- Pipeline Step
- Skip Cell
With the use of Kale tags we define the following:
- Pipeline parameters are assigned using the "pipeline parameters" tag
- The necessary libraries that need to be used throughout the Pipeline are passed through the "imports" tag
- Notebook cells are assigned to specific Pipeline components (download data, load data, etc.) using the "pipeline step" tag
- Cell dependencies are defined between the different pipeline steps with the "depends on" flag
- Pipeline metrics are assigned using the "pipeline metrics" tag
The pipeline steps created in this example:
- Load data
- Transform data
- Feature Engineering
- Modelling
- Prediction
-
Compile and run the Notebook by hitting the "Compile & Run" in Kale's left panel
View Pipeline by clicking "View" in Kale's left panel