# Lesson 1.2: Artifact Lineage

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/1-2_Artifact_Lineage.ipynb)

***Key Concepts:*** *Artifacts, Artifact Stores, Metadata, Versioning, Caching*

In this lesson we will learn about one of the coolest features of ML pipelines: automated artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables artifact caching, allowing us to switch out parts of our ML pipelines without having to rerun any previous steps.

First, if you have not done so already, run the following cell to install ZenML
and it's sklearn integration.

In [None]:
%pip install "zenml[server]"
!zenml integration install sklearn -y
!rm -rf .zen
!zenml init
%pip install pyparsing==2.4.2 # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

**Colab Note:** On Colab, you need an [ngrok account](https://dashboard.ngrok.com/signup) to view some of the visualizations later. Please set up an account, then set your user token below:

In [None]:
NGROK_TOKEN = "" # TODO: set your ngrok token if you are working on Colab

In [None]:
from zenml.environment import Environment

if Environment.in_google_colab(): # Colab only setup

 # clone zenbytes repo to get source code of previous lessons
 !git clone https://github.com/zenml-io/zenbytes.git # noqa
 !mv zenbytes/steps .
 !mv zenbytes/pipelines .

 # install and authenticate ngrok
 !pip install pyngrok
 !ngrok authtoken {NGROK_TOKEN}

Before we dive into any versioning and caching, let's clarify what exactly **Artifacts** are. 
To illustrate, let us first rebuild our digits pipeline from the previous lesson:

In [None]:
from zenml.pipelines import pipeline

from steps.evaluator import evaluator
from steps.importer import importer
from steps.sklearn_trainer import svc_trainer


@pipeline
def digits_pipeline(importer, trainer, evaluator):
 """Links all the steps together in a pipeline"""
 X_train, X_test, y_train, y_test = importer()
 model = trainer(X_train=X_train, y_train=y_train)
 evaluator(X_test=X_test, y_test=y_test, model=model)

The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. ZenML automatically saves, tracks, and versions these
artifacts for you, so you can more easiy reproduce your ML workflows in the
future.

## Pipeline Visualization in ZenML

To see how the steps connect the different artifacts, you can check the
pipeline run visualization in the ZenML dashboard, as you learned in the
previous lesson.

Let's do so again by running the following code cells and navigating to the
pipeline "Runs" tab in the ZenML dashboard.

In [None]:
digits_svc_pipeline = digits_pipeline(
 importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run(unlisted=True)

In [None]:
from zenml.environment import Environment

def start_zenml_dashboard(port=8237):
 if Environment.in_google_colab():
 from pyngrok import ngrok

 public_url = ngrok.connect(port)
 print(f"\x1b[31mIn Colab, use this URL instead: {public_url}!\x1b[0m")
 !zenml up --blocking --port {port}

 else:
 !zenml up --port {port}

start_zenml_dashboard()

**Note:** If you're running on Colab, you will not be able to access the regular dashboard link. Instead, use the `ngrok.io` link printed above!

You should now see an interactive visualization in the ZenML dashboard, as shown below. 
The rectangles represent your pipeline steps and the circles your pipeline artifacts. 
Also, note that the different nodes are color-coded, so if your pipeline ever 
fails or runs for too long, you can find the responsible step at a glance!

![Dash Visualization](_assets/1-2/dashboard_initial.png)

## Artifact Caching
As mentioned in the beginning, tracking which exact artifact went into what 
steps allows us to cache and reuse artifacts. Let's see this in action by
rerunning our pipeline without modifications:

In [None]:
digits_svc_pipeline.run(unlisted=True)

In the output above, you should now see that each step of the run was cached: 
```
Creating unlisted run ... (Caching enabled)
...
Using cached version of importer.
...
Using cached version of svc_trainer.
...
Using cached version of evaluator.
```

In the dashboard you should see the same. 
Navigate to the pipeline "Runs" tab again, click on the newest run, and view
its DAG.

In [None]:
start_zenml_dashboard()

You should now see a visualization as shown below. Note how the icons of all
steps has changed. This means they were cached from a previous run.

![Dashboard Run Visualization Cached](_assets/1-2/dashboard_cached.png)

Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from zenml.steps import step


@step()
def tree_trainer(
 X_train: np.ndarray,
 y_train: np.ndarray,
) -> ClassifierMixin:
 """Train an sklearn decision tree classifier."""
 model = DecisionTreeClassifier()
 model.fit(X_train, y_train)
 return model


# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
 importer=importer(), trainer=tree_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run(unlisted=True)

In the output above, you should now see that the `importer` step was still 
cached since the underlying data did not change. However, the trainer and
evaluator had to be executed again.

In the dashboard you should again see the same. 
Navigate to the pipeline "Runs" tab again, click on the newest run, and view
its DAG.

In [None]:
start_zenml_dashboard()

The visualization should now look as shown below. Since we changed the trainer, 
the corresponding node and all subsequent nodes were executed and created fresh
artifacts. However, note how the dataset artifacts are still cached. 
They did not have to be recreated. 
In an actual production setting, this might save us a tremendous amount of time 
and resources as those data artifacts might have resulted from some complex, 
expensive preprocessing job.

![Dash Visualization Partly Cached](_assets/1-2/dashboard_partly_cached.png)


## Artifact Storage

You might now wonder how our ML pipelines can keep track of which artifacts changed and which did not. This requires several additional MLOps components that you would typically have to set up and configure yourself. Luckily, ZenML automatically set this up for us.

Under the hood, all the artifacts in our ML pipeline are automatically stored in an [Artifact Store](https://docs.zenml.io/user-guide/starter-guide/understand-stacks#artifact-store). By default, this is simply a place in your local file system, but we could also configure ZenML to store this data in a cloud bucket like [Amazon S3](https://docs.zenml.io/component-gallery/artifact-stores/s3) or any other place instead. We will see this in more detail when we migrate our MLOps stack to the cloud in a later chapter.

## Orchestrators

In addition to the artifact store, ZenML automatically set an
[Orchestrator](https://docs.zenml.io/user-guide/starter-guide/understand-stacks#orchestrator) for you,
which is the component that defines how and where each pipeline step is executed 
when calling `pipeline.run()`. 

This component is not of much interest to us right now, but we will learn more 
about it in later chapters, when we will run our pipelines on a 
[Kubernetes](https://kubernetes.io/) cluster using the 
[Kubeflow](https://docs.zenml.io/stacks-and-components/component-guide/orchestrators/kubeflow) orchestrator.

## ZenML MLOps Stacks

Artifact stores, together with orchestrators, build the backbone of a ZenML 
**Stack**, which defines all of the infrastructure and tools that your ML
workflows are running on.

![Local MLOps Stack](_assets/1-2/local_stack_redesigned.png)

If you have the ZenML dashboard running, you can see a list of all your MLOps
stacks under the "Stacks" section. Currently, you will only see the "default"
stack there, which consists of a local artifact store and local orchestrator.

Under the "Stack Components" tab you can browse all stack components that you
have currently registered with ZenML. You can combine those in any way you like
into new stacks. Currently, you should only see a single "default" component for 
both "Orchestrator" and "Artifact Store", but we are going to register more
stack components in subsequent lessons.

![Dashboard Stack List](_assets/1-2/dashboard_stack_list.png)

If you click on the "default" artifact store and navigate to the "Runs" tab,
you will see all runs that were executed on this component.

![Dashboard Stack List](_assets/1-2/dashboard_artifact_store_run_list.png)

We will add several more components to our MLOps stack throughout the subsequent chapters, including model deployment tools, experiment trackers, data and model monitoring tools, and more. Let's start with experiment tracking in the [next lesson](2-1_Experiment_Tracking.ipynb).