{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 1.2: Artifact Lineage\n", "\n", "[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/1-2_Artifact_Lineage.ipynb)\n", "\n", "***Key Concepts:*** *Artifacts, Artifact Stores, Metadata, Versioning, Caching*\n", "\n", "In this lesson we will learn about one of the coolest features of ML pipelines: automated artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables artifact caching, allowing us to switch out parts of our ML pipelines without having to rerun any previous steps." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, if you have not done so already, run the following cell to install ZenML\n", "and it's sklearn integration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install \"zenml[server]\"\n", "!zenml integration install sklearn -y\n", "!rm -rf .zen\n", "!zenml init\n", "%pip install pyparsing==2.4.2 # required for Colab\n", "\n", "import IPython\n", "\n", "# automatically restart kernel\n", "IPython.Application.instance().kernel.do_shutdown(restart=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Colab Note:** On Colab, you need an [ngrok account](https://dashboard.ngrok.com/signup) to view some of the visualizations later. Please set up an account, then set your user token below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "NGROK_TOKEN = \"\" # TODO: set your ngrok token if you are working on Colab" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from zenml.environment import Environment\n", "\n", "if Environment.in_google_colab(): # Colab only setup\n", "\n", " # clone zenbytes repo to get source code of previous lessons\n", " !git clone https://github.com/zenml-io/zenbytes.git # noqa\n", " !mv zenbytes/steps .\n", " !mv zenbytes/pipelines .\n", "\n", " # install and authenticate ngrok\n", " !pip install pyngrok\n", " !ngrok authtoken {NGROK_TOKEN}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we dive into any versioning and caching, let's clarify what exactly **Artifacts** are. \n", "To illustrate, let us first rebuild our digits pipeline from the previous lesson:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from zenml.pipelines import pipeline\n", "\n", "from steps.evaluator import evaluator\n", "from steps.importer import importer\n", "from steps.sklearn_trainer import svc_trainer\n", "\n", "\n", "@pipeline\n", "def digits_pipeline(importer, trainer, evaluator):\n", " \"\"\"Links all the steps together in a pipeline\"\"\"\n", " X_train, X_test, y_train, y_test = importer()\n", " model = trainer(X_train=X_train, y_train=y_train)\n", " evaluator(X_test=X_test, y_test=y_test, model=model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. ZenML automatically saves, tracks, and versions these\n", "artifacts for you, so you can more easiy reproduce your ML workflows in the\n", "future." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline Visualization in ZenML\n", "\n", "To see how the steps connect the different artifacts, you can check the\n", "pipeline run visualization in the ZenML dashboard, as you learned in the\n", "previous lesson.\n", "\n", "Let's do so again by running the following code cells and navigating to the\n", "pipeline \"Runs\" tab in the ZenML dashboard." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digits_svc_pipeline = digits_pipeline(\n", " importer=importer(), trainer=svc_trainer(), evaluator=evaluator()\n", ")\n", "digits_svc_pipeline.run(unlisted=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from zenml.environment import Environment\n", "\n", "def start_zenml_dashboard(port=8237):\n", " if Environment.in_google_colab():\n", " from pyngrok import ngrok\n", "\n", " public_url = ngrok.connect(port)\n", " print(f\"\\x1b[31mIn Colab, use this URL instead: {public_url}!\\x1b[0m\")\n", " !zenml up --blocking --port {port}\n", "\n", " else:\n", " !zenml up --port {port}\n", "\n", "start_zenml_dashboard()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** If you're running on Colab, you will not be able to access the regular dashboard link. Instead, use the `ngrok.io` link printed above!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see an interactive visualization in the ZenML dashboard, as shown below. \n", "The rectangles represent your pipeline steps and the circles your pipeline artifacts. \n", "Also, note that the different nodes are color-coded, so if your pipeline ever \n", "fails or runs for too long, you can find the responsible step at a glance!\n", "\n", "![Dash Visualization](_assets/1-2/dashboard_initial.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Artifact Caching\n", "As mentioned in the beginning, tracking which exact artifact went into what \n", "steps allows us to cache and reuse artifacts. Let's see this in action by\n", "rerunning our pipeline without modifications:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "digits_svc_pipeline.run(unlisted=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the output above, you should now see that each step of the run was cached: \n", "```\n", "Creating unlisted run ... (Caching enabled)\n", "...\n", "Using cached version of importer.\n", "...\n", "Using cached version of svc_trainer.\n", "...\n", "Using cached version of evaluator.\n", "```\n", "\n", "In the dashboard you should see the same. \n", "Navigate to the pipeline \"Runs\" tab again, click on the newest run, and view\n", "its DAG." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_zenml_dashboard()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see a visualization as shown below. Note how the icons of all\n", "steps has changed. This means they were cached from a previous run.\n", "\n", "![Dashboard Run Visualization Cached](_assets/1-2/dashboard_cached.png)\n", "\n", "Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.base import ClassifierMixin\n", "from sklearn.tree import DecisionTreeClassifier\n", "from zenml.steps import step\n", "\n", "\n", "@step()\n", "def tree_trainer(\n", " X_train: np.ndarray,\n", " y_train: np.ndarray,\n", ") -> ClassifierMixin:\n", " \"\"\"Train an sklearn decision tree classifier.\"\"\"\n", " model = DecisionTreeClassifier()\n", " model.fit(X_train, y_train)\n", " return model\n", "\n", "\n", "# redefine and rerun our pipeline, this time with tree_trainer()\n", "digits_tree_pipeline = digits_pipeline(\n", " importer=importer(), trainer=tree_trainer(), evaluator=evaluator()\n", ")\n", "digits_tree_pipeline.run(unlisted=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the output above, you should now see that the `importer` step was still \n", "cached since the underlying data did not change. However, the trainer and\n", "evaluator had to be executed again.\n", "\n", "In the dashboard you should again see the same. \n", "Navigate to the pipeline \"Runs\" tab again, click on the newest run, and view\n", "its DAG." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_zenml_dashboard()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The visualization should now look as shown below. Since we changed the trainer, \n", "the corresponding node and all subsequent nodes were executed and created fresh\n", "artifacts. However, note how the dataset artifacts are still cached. \n", "They did not have to be recreated. \n", "In an actual production setting, this might save us a tremendous amount of time \n", "and resources as those data artifacts might have resulted from some complex, \n", "expensive preprocessing job.\n", "\n", "![Dash Visualization Partly Cached](_assets/1-2/dashboard_partly_cached.png)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Artifact Storage\n", "\n", "You might now wonder how our ML pipelines can keep track of which artifacts changed and which did not. This requires several additional MLOps components that you would typically have to set up and configure yourself. Luckily, ZenML automatically set this up for us.\n", "\n", "Under the hood, all the artifacts in our ML pipeline are automatically stored in an [Artifact Store](https://docs.zenml.io/user-guide/starter-guide/understand-stacks#artifact-store). By default, this is simply a place in your local file system, but we could also configure ZenML to store this data in a cloud bucket like [Amazon S3](https://docs.zenml.io/component-gallery/artifact-stores/s3) or any other place instead. We will see this in more detail when we migrate our MLOps stack to the cloud in a later chapter." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Orchestrators\n", "\n", "In addition to the artifact store, ZenML automatically set an\n", "[Orchestrator](https://docs.zenml.io/user-guide/starter-guide/understand-stacks#orchestrator) for you,\n", "which is the component that defines how and where each pipeline step is executed \n", "when calling `pipeline.run()`. \n", "\n", "This component is not of much interest to us right now, but we will learn more \n", "about it in later chapters, when we will run our pipelines on a \n", "[Kubernetes](https://kubernetes.io/) cluster using the \n", "[Kubeflow](https://docs.zenml.io/stacks-and-components/component-guide/orchestrators/kubeflow) orchestrator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ZenML MLOps Stacks\n", "\n", "Artifact stores, together with orchestrators, build the backbone of a ZenML \n", "**Stack**, which defines all of the infrastructure and tools that your ML\n", "workflows are running on.\n", "\n", "![Local MLOps Stack](_assets/1-2/local_stack_redesigned.png)\n", "\n", "If you have the ZenML dashboard running, you can see a list of all your MLOps\n", "stacks under the \"Stacks\" section. Currently, you will only see the \"default\"\n", "stack there, which consists of a local artifact store and local orchestrator.\n", "\n", "Under the \"Stack Components\" tab you can browse all stack components that you\n", "have currently registered with ZenML. You can combine those in any way you like\n", "into new stacks. Currently, you should only see a single \"default\" component for \n", "both \"Orchestrator\" and \"Artifact Store\", but we are going to register more\n", "stack components in subsequent lessons.\n", "\n", "![Dashboard Stack List](_assets/1-2/dashboard_stack_list.png)\n", "\n", "If you click on the \"default\" artifact store and navigate to the \"Runs\" tab,\n", "you will see all runs that were executed on this component.\n", "\n", "![Dashboard Stack List](_assets/1-2/dashboard_artifact_store_run_list.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will add several more components to our MLOps stack throughout the subsequent chapters, including model deployment tools, experiment trackers, data and model monitoring tools, and more. Let's start with experiment tracking in the [next lesson](2-1_Experiment_Tracking.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 64-bit ('zenbytes-dev')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "ec45946565c50b1d690aa5a9e3c974f5b62b9cc8d8934e441e52186140f79402" } } }, "nbformat": 4, "nbformat_minor": 2 }