Skip to content

Commit

Permalink
Out-of-sample predictions for BART intro notebook (pymc-devs#507)
Browse files Browse the repository at this point in the history
* add out-os-sample section

* add regression example
  • Loading branch information
juanitorduz authored Jan 26, 2023
1 parent 951c0ef commit dbcce72
Show file tree
Hide file tree
Showing 2 changed files with 952 additions and 38 deletions.
796 changes: 762 additions & 34 deletions examples/case_studies/BART_introduction.ipynb

Large diffs are not rendered by default.

194 changes: 190 additions & 4 deletions examples/case_studies/BART_introduction.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jupytext:
format_name: myst
format_version: 0.13
kernelspec:
display_name: Python 3 (ipykernel)
display_name: pymc-examples-env
language: python
name: python3
---
Expand All @@ -27,6 +27,11 @@ import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
import seaborn as sns
from sklearn.model_selection import train_test_split
%config InlineBackend.figure_format = "retina"
print(f"Running on PyMC v{pm.__version__}")
```
Expand Down Expand Up @@ -86,7 +91,7 @@ In PyMC a BART variable can be defined very similar to other random variables. O
```{code-cell} ipython3
with pm.Model() as model_coal:
μ_ = pmb.BART("μ_", X=x_data, Y=y_data, m=20)
μ = pm.Deterministic("μ", np.abs(μ_))
μ = pm.Deterministic("μ", pm.math.abs(μ_))
y_pred = pm.Poisson("y_pred", mu=μ, observed=y_data)
idata_coal = pm.sample(random_seed=RANDOM_SEED)
```
Expand Down Expand Up @@ -137,15 +142,17 @@ try:
except FileNotFoundError:
bikes = pd.read_csv(pm.get_data("bikes.csv"))
X = bikes[["hour", "temperature", "humidity", "workingday"]]
features = ["hour", "temperature", "humidity", "workingday"]
X = bikes[features]
Y = bikes["count"]
```

```{code-cell} ipython3
with pm.Model() as model_bikes:
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=np.abs(μ), alpha=α, observed=Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y)
idata_bikes = pm.sample(random_seed=RANDOM_SEED)
```

Expand Down Expand Up @@ -179,11 +186,190 @@ Additionally, PyMC-BART provides a novel method to assess the variable importanc
pmb.plot_variable_importance(idata_bikes, μ, X, samples=100);
```

### Out-of-Sample Predictions

In this section we want to show how to do out-of-sample predictions with BART. We are going to use the same dataset as before, but this time we are going to split the data into a training and a test set. We are going to use the training set to fit the model and the test set to evaluate the model.

+++

#### Regression

Let's start by modelling this data as a regression problem. In this context we randomly split the data into a training and a test set.

```{code-cell} ipython3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=RANDOM_SEED)
```

Now, we fit the same model as above but this time using a *shared variable* for the covariatates so that we can then replace them to generate the out of sample predictions.

```{code-cell} ipython3
with pm.Model() as model_oos_regression:
X = pm.MutableData("X", X_train)
Y = Y_train
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
trace=idata_oos_regression, random_seed=RANDOM_SEED
)
```

Next, we replace the data in the model and sample from the posterior predictive distribution.

```{code-cell} ipython3
with model_oos_regression:
X.set_value(X_test)
posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
trace=idata_oos_regression, random_seed=RANDOM_SEED
)
```

Finally, we can compare the posterior predictive distribution with the observed data.

```{code-cell} ipython3
:tags: [hide-input]
fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)
az.plot_ppc(
data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
)
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
az.plot_ppc(
data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
)
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
```

Yay! The results look quite reasonable 🙂!

+++

#### Time Series

We can view the same data from a *time series* perspective using the `hour` feature. From this point of view, we need to make sure we do not shuffle the data so that we do not leak information. Thus, we define th train-test split using the `hour` feature.

```{code-cell} ipython3
train_test_hour_split = 19
train_bikes = bikes.query("hour <= @train_test_hour_split")
test_bikes = bikes.query("hour > @train_test_hour_split")
X_train = train_bikes[features]
Y_train = train_bikes["count"]
X_test = test_bikes[features]
Y_test = test_bikes["count"]
```

We can then run the same model (but with different input data!) and generate out-of-sample predictions as above.

```{code-cell} ipython3
with pm.Model() as model_oos_ts:
X = pm.MutableData("X", X_train)
Y = Y_train
α = pm.Exponential("α", 1 / 10)
μ = pmb.BART("μ", X, Y)
y = pm.NegativeBinomial("y", mu=pm.math.abs(μ), alpha=α, observed=Y, shape=μ.shape)
idata_oos_ts = pm.sample(random_seed=RANDOM_SEED)
posterior_predictive_oos_ts_train = pm.sample_posterior_predictive(
trace=idata_oos_ts, random_seed=RANDOM_SEED
)
```

We generate out-of-sample predictions.

```{code-cell} ipython3
with model_oos_ts:
X.set_value(X_test)
posterior_predictive_oos_ts_test = pm.sample_posterior_predictive(
trace=idata_oos_ts, random_seed=RANDOM_SEED
)
```

Similarly as above, we can compare the posterior predictive distribution with the observed data.

```{code-cell} ipython3
:tags: [hide-input]
fig, ax = plt.subplots(
nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)
az.plot_ppc(data=posterior_predictive_oos_ts_train, kind="cumulative", observed_rug=True, ax=ax[0])
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))
az.plot_ppc(data=posterior_predictive_oos_ts_test, kind="cumulative", observed_rug=True, ax=ax[1])
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
```

Wow! This does not look right! The predictions on the test set look very odd 🤔. To better understand what is going on we can plot the predictions as time series:

```{code-cell} ipython3
:tags: [hide-input]
fig, ax = plt.subplots(figsize=(12, 6))
az.plot_hdi(
x=X_train.index,
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
hdi_prob=0.94,
color="C0",
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (train)"},
smooth=False,
ax=ax,
)
az.plot_hdi(
x=X_train.index,
y=posterior_predictive_oos_ts_train.posterior_predictive["y"],
hdi_prob=0.5,
color="C0",
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (train)"},
smooth=False,
ax=ax,
)
ax.plot(X_train.index, Y_train, label="train (observed)")
az.plot_hdi(
x=X_test.index,
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
hdi_prob=0.94,
color="C1",
fill_kwargs={"alpha": 0.2, "label": r"94$\%$ HDI (test)"},
smooth=False,
ax=ax,
)
az.plot_hdi(
x=X_test.index,
y=posterior_predictive_oos_ts_test.posterior_predictive["y"],
hdi_prob=0.5,
color="C1",
fill_kwargs={"alpha": 0.4, "label": r"50$\%$ HDI (test)"},
smooth=False,
ax=ax,
)
ax.plot(X_test.index, Y_test, label="test (observed)")
ax.axvline(X_train.shape[0], color="k", linestyle="--", label="train/test split")
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.set(
title="BART model predictions for bike rentals",
xlabel="observation index",
ylabel="number of rentals",
);
```

This plot helps us understand the season behind the bad performance on the test set: Recall that in the variable importance ranking from the initial model we saw that `hour` was the most important predictor. On the other hand, our training data just sees `hour` values until $19$ (since is our train-test threshold). As BART learns how to partition the (training) data, it can not differentiate between `hour` values between $20$ and $22$ for example. It just cares that both values are greater that $19$. This is very important to understand when using BART! This explains why one should not use BART for time series forecasting if there is a trend component. In this case it is better to detrend the data first, model the remainder with BART and model the trend with a different model.

+++

## Authors
* Authored by Osvaldo Martin in Dec, 2021 ([pymc-examples#259](https://github.com/pymc-devs/pymc-examples/pull/259))
* Updated by Osvaldo Martin in May, 2022 ([pymc-examples#323](https://github.com/pymc-devs/pymc-examples/pull/323))
* Updated by Osvaldo Martin in Sep, 2022
* Updated by Osvaldo Martin in Nov, 2022
* Juan Orduz added out-of-sample section in Jan, 2023

+++

Expand Down

0 comments on commit dbcce72

Please sign in to comment.