Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update glm rolling regression #197

Merged
merged 5 commits into from
Sep 22, 2021

Conversation

chiral-carbon
Copy link
Collaborator

Addresses issue #89 and aims to use bambi instead of pymc3 GLM module.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 1, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-01T09:16:16Z
----------------------------------------------------------------

Line #2.        trace_rw = pm.sample(tune=2000, cores=4, target_accept=0.9)

This should use return_inferencedata=True


@review-notebook-app
Copy link

review-notebook-app bot commented Aug 3, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-03T19:28:01Z
----------------------------------------------------------------

just adding what we already commented here in the PR too. Here use coords when defining the model


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-03T19:28:02Z
----------------------------------------------------------------

slice at dataset level and choose variables afterwards plus try to remove the double loop. I think no loop will be needed, but maybe 1 loop will be to get the right coloring.


@twiecki
Copy link
Member

twiecki commented Aug 4, 2021

I feel like we should just do the linear regression by hand here instead of using bambi, that'd be more obvious.

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 6, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-06T15:35:05Z
----------------------------------------------------------------

try fixing xticklabels


OriolAbril commented on 2021-08-26T15:45:55Z
----------------------------------------------------------------

try

fig.autofmt_xdate()

reference: https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-10T22:42:58Z
----------------------------------------------------------------

need help here on how to convert this to a single loop. I sliced it at the dataset level and converted xi to a DataArray but after we still need each value in the selection to be multiplied with xi , as we can't multiply xi with alpha or beta since they have different shapes (50 and 9 respectively).

I'm afraid it's still unclear to me after you explained it :/


@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-11T07:12:15Z
----------------------------------------------------------------

Line #15.    xi = xr.DataArray(np.linspace(prices_zscored.GFI.min(), prices_zscored.GFI.max(), 50))

The DataArray needs to have it's dimension named. Then you can do alpha + beta * xi and xarray will know how to align and broadcast the result. If the dimension is not named, xarray doesn't know if the dimension of xi is supposed to be one of the ones in alpha or beta and can't know if it has to align xi with one of the current dimensions in alpha/beta or create a new dimension in alpha/beta and align with this new one. (note, this sum and product will work in xarray both if being done before and after stacking, so I would stack afterwards to make sure the sample dimension is the last one which will be needed for plotting)

To reduce the code to a single looping, I would do something like (this is pseudocode, it won't work as is, it's only illustrational):

dataarray = (a + b * x).stack(...)

for t in dataarray.time:
ax.plot(x, dataarray.sel(time=t), color=color_fun(t), ...)

I think this will differ slightly from the current plot because the current plot has a color slighly different for every line, however, the color should represent the time according to the colormap on the plot, so multiple samples from the same time should have the exactly the same color not a similar one.



@review-notebook-app
Copy link

review-notebook-app bot commented Aug 17, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-17T13:05:36Z
----------------------------------------------------------------

replaced the glm with manual linear regression here


@review-notebook-app
Copy link

review-notebook-app bot commented Aug 17, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-17T13:05:37Z
----------------------------------------------------------------

the kernel dies here when I run this code, need some help with fixing this


OriolAbril commented on 2021-08-24T16:28:23Z
----------------------------------------------------------------

where and how does the kernel die? Have you tried for example only looping and subetting the data, no stacking nor plotting? or stacking before looping with something like

reg_eq = (trace_rw.posterior["alpha"] + trace_rw.posterior["beta"] * xi).stack(sample=("chain", "draw"))

chiral-carbon commented on 2021-08-25T06:03:04Z
----------------------------------------------------------------

it dies here:

for i in prices.index.values[::15]:
   reg = reg_eq.sel(time=i).stack(pooled_chain=("chain", "draw"))
   plt.plot(xi, reg)

will try stacking before looping, should reduce the complexity somewhat.

Copy link
Member

where and how does the kernel die? Have you tried for example only looping and subetting the data, no stacking nor plotting? or stacking before looping with something like

reg_eq = (trace_rw.posterior["alpha"] + trace_rw.posterior["beta"] * xi).stack(sample=("chain", "draw"))

View entire conversation on ReviewNB

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-24T16:32:16Z
----------------------------------------------------------------

what are the dimensions and coordinates of alpha? have you tried using xarray plotting to get labels generated automatically?


Copy link
Collaborator Author

it dies here:

for i in prices.index.values[::15]:
   reg = reg_eq.sel(time=i).stack(pooled_chain=("chain", "draw"))
   plt.plot(xi, reg)

will try stacking before looping, should reduce the complexity somewhat.


View entire conversation on ReviewNB

Copy link
Member

try `fig.autofmt_xdate()

reference: https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 26, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-08-26T15:54:56Z
----------------------------------------------------------------

now plotting works, but still needs some work

  1. double check slicing. If I haven't messed up checking the above cells or while operating, the slicing on the sample dimension is not working. There are 4 chains and 1000 draws -> 4000 samples. If we take one every 200 samples we should have 20 lines per time. I think I see more than 20 lines of the same color, or maybe the lines are very thick?
  2. I think lines should be a bit transparent and be below the scatterplot (play with zorder). Moreover, it looks like unlike in the above plot the scatterplot doesn't have the right coloring. This should be fixed now that we are updating and correcting the notebook.
  3. The colormap is now broken, it looks like it does a max mix between the old code and the code above to fix the locator warnings, but unlike above, here it's not working.

chiral-carbon commented on 2021-08-28T08:12:16Z
----------------------------------------------------------------

we are plotting 23 lines, and that's because we are slicing prices (which has length 1136) every 50 entries. so I think I should slice the sample dimension < 200 times?

should I use different colormaps for the scatter plot and the lines? currently both use plt.get_cmap("winter")

OriolAbril commented on 2021-08-28T20:06:21Z
----------------------------------------------------------------

We have 4 chains, 1000 draws and 1136 times (note we are slicing the index of prices which is the time, not the prices themselves). So we have 4544000 lines that we could plot which is way too much if we plan to interpret the plot and not have it be a mess of overlapping lines.

So we loop over 1 out of every 10 times and for each of those we plot multiple lines (we could plot 4000 lines per time!). I think a handful of lines ~10-30 is enough to show that the model predictions take some uncertainty into account even within a single time. Note also that we want to encode the time each line corresponds to into its color. We will then have multiple lines with the same color because we are plotting multiple samples that correspond to the same time.

The 23 therefore is the number of times we are plotting, not the number of lines. The number of lines is 23 * 4000 / 200 ~= 450 If my mental math is right.

As I said before, we want the color to encode the time information, so we need both scatter and line plots to use the same encoding -> they should use the same colormap and the same scale to transform time values to 0-1 values used by the colormap.

Copy link
Collaborator Author

we are plotting 23 lines, and that's because we are slicing prices (which has length 1136) every 50 entries. so I think I should slice the sample dimension < 200 times?

should use different colormaps for the scatter plot and the lines? currently both use plt.get_cmap("winter")


View entire conversation on ReviewNB

@chiral-carbon chiral-carbon changed the title update glm rolling regression to use bambi update glm rolling regression Aug 28, 2021
@review-notebook-app
Copy link

review-notebook-app bot commented Aug 28, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-28T08:58:05Z
----------------------------------------------------------------

using FixedLocator removes this warning:

UserWarning: FixedFormatter should only be used together with FixedLocator

but the yticklabels in the colorbar end up not getting printed properly. this happens in the plots below in the notebook as well.


OriolAbril commented on 2021-08-28T19:41:29Z
----------------------------------------------------------------

The suggestions provided by warnings can be useful, but they do what they can in trying to help avoid the warning. It may only cover the most common case or we might be triggering the warning via an edge case the developers did not consider. Here using this fixedlocator approach is not a solution because while it's true it prevents the warning from being printed, it completely breaks the plot, you have to try something else. This example from matplotlib docs might be worth playing with: https://matplotlib.org/stable/gallery/ticks_and_spines/colorbar_tick_labelling_demo.html

chiral-carbon commented on 2021-08-30T11:34:42Z
----------------------------------------------------------------

I tried the examples provided in the link, mainly by passing a ticks argument to plt.colorbar() but that does the same thing as before; in which the warning does get removed but the labels on the colorbar disappear like we see here and in the other plots below.

OriolAbril commented on 2021-08-30T16:48:37Z
----------------------------------------------------------------

This works for me:

@review-notebook-app
Copy link

review-notebook-app bot commented Aug 28, 2021

View / edit / reply to this conversation on ReviewNB

chiral-carbon commented on 2021-08-28T08:58:17Z
----------------------------------------------------------------

using FixedLocator removes this warning:

UserWarning: FixedFormatter should only be used together with FixedLocator

but the yticklabels in the colorbar end up not getting printed properly. this happens in the plots below in the notebook as well.


Copy link
Member

The suggestions provided by warnings can be useful, but they do what they can in trying to help avoid the warning. It may only cover the most common case or we might be triggering the warning via an edge case the developers did not consider. Here using this fixedlocator approach is not a solution because while it's true it prevents the warning from being printed, it completely breaks the plot, you have to try something else. This example from matplotlib docs might be worth playing with: https://matplotlib.org/stable/gallery/ticks_and_spines/colorbar_tick_labelling_demo.html


View entire conversation on ReviewNB

Copy link
Member

We have 4 chains, 1000 draws and 1136 times (note we are slicing the index of prices which is the time, not the prices themselves). So we have 4544000 lines that we could plot which is way too much if we plan to interpret the plot and not have it be a mess of overlapping lines.

So we loop over 1 out of every 10 times and for each of those we plot multiple lines (we could plot 4000 lines per time!). I think a handful of lines ~10-30 is enough to show that the model predictions take some uncertainty into account even within a single time. Note also that we want to encode the time each line corresponds to into its color. We will then have multiple lines with the same color because we are plotting multiple samples that correspond to the same time.

The 23 therefore is the number of times we are plotting, not the number of lines. The number of lines is 23 * 4000 / 200 ~= 450 If my mental math is right.

As I said before, we want the color to encode the time information, so we need both scatter and line plots to use the same encoding -> they should use the same colormap and the same scale to transform time values to 0-1 values used by the colormap.


View entire conversation on ReviewNB

Copy link
Collaborator Author

I tried the examples provided in the link, mainly by passing a ticks argument to plt.colorbar() but that does the same thing as before; in which the warning does get removed but the labels on the colorbar disappear like we see here and in the other plots below.


View entire conversation on ReviewNB

Copy link
Member

OriolAbril commented Aug 30, 2021

This works for me:

image

I have also realized that we should change the colormap to one of the perceptually uniform sequential colormaps, see https://matplotlib.org/stable/tutorials/colors/colormaps.html (for all plots)


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 8, 2021

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-08T15:32:12Z
----------------------------------------------------------------

Line #4.    mymap = plt.get_cmap("winter")

change this colormap to a perceptually sequential colormap (see link from previous comment)


chiral-carbon commented on 2021-09-08T17:10:08Z
----------------------------------------------------------------

does this colormap look better? there are of course others I could change this with

OriolAbril commented on 2021-09-08T17:20:35Z
----------------------------------------------------------------

It's not an appearance issue. Any of the perceptually sequential colormaps is fine by me. Here we are encoding the time which is sequential in our case, so with our aim of promoting best practices in notebooks as much as possible, not only with pymc3 related things we should use a perceptually sequential colormap. If instead we were interested in a diverging quantity such as values being above or below the mean or timestamps being earlier or later than a given date, we would use a diverging colormap.

The link I shared in 2 comments ago from matplotlib docs (https://matplotlib.org/stable/tutorials/colors/colormaps.html) explains it much better, the above paragraph is a barebones overview

@review-notebook-app
Copy link

View / edit / reply to this conversation on ReviewNB

OriolAbril commented on 2021-09-08T15:32:12Z
----------------------------------------------------------------

Line #9.    colors = np.linspace(0.1, 1, len(prices))

start linspaces at 0

Line #12.    mymap = plt.get_cmap("winter")

Line #13. mymap_sc = plt.get_cmap("winter")

there is no need to create two colormaps nor to overwrite the original one. plt.get_cmap("winter") returns a callable which takes values between 0-1 and converts them to a color



Copy link
Collaborator Author

does this colormap look better? there are of course others I could change this with


View entire conversation on ReviewNB

Copy link
Member

It's not an appearance issue. Any of the perceptually sequential colormaps is fine by me. Here we are encoding the time which is sequential in our case, so with our aim of promoting best practices in notebooks as much as possible, not only with pymc3 related things we should use a perceptually sequential colormap. If instead we were interested in a diverging quantity such as values being above or below the mean or timestamps being earlier or later than a given date, we would use a diverging colormap.

The link I shared in 2 comments ago from matplotlib docs (https://matplotlib.org/stable/tutorials/colors/colormaps.html) explains it much better, the above paragraph is a barebones overview


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 11, 2021

View / edit / reply to this conversation on ReviewNB

Sayam753 commented on 2021-09-11T05:24:27Z
----------------------------------------------------------------

Pairs trading link is not working


chiral-carbon commented on 2021-09-11T10:15:50Z
----------------------------------------------------------------

ooh thanks a lot for pointing it out

@review-notebook-app
Copy link

review-notebook-app bot commented Sep 11, 2021

View / edit / reply to this conversation on ReviewNB

Sayam753 commented on 2021-09-11T05:24:28Z
----------------------------------------------------------------

And stochastic volatility model link is also dead

Copy link
Collaborator Author

ooh thanks a lot for pointing it out


View entire conversation on ReviewNB

@OriolAbril
Copy link
Member

There is also an issue with many broken links throughout the notebooks linked in https://github.com/pymc-devs/pymc-examples/wiki/Notebook-updates-overview

correction: use return_inferencedata=True in pm.sample

commits: using coords and removing double loops

change plotting code

correction in plotting

fix warnings

change colormap

revised: colormap
@chiral-carbon
Copy link
Collaborator Author

not sure if this needs a references section? the link to pairs trading in the beginning of the notebook is the only external reference, should I include it in references.bib file?

@OriolAbril OriolAbril merged commit 1f40572 into pymc-devs:main Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants