Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labeling restructure #1201

Merged
merged 20 commits into from
Feb 21, 2021
Merged

Labeling restructure #1201

merged 20 commits into from
Feb 21, 2021

Conversation

OriolAbril
Copy link
Member

@OriolAbril OriolAbril commented May 22, 2020

Description

Enforcing and maintaining index_origin at label display has proved to be plenty of work and to not give the best of results. I propose to use index_origin at InferenceData creation and then always display coordinate values. I think this will simplify the code significantly without many drawbacks. Moreover, this will make all representations of the data coherent, from xarray html repr to ArviZ plots.

Therefore, the aims of this PR are:

image

  • Remove index_origin from stats and plotting functions. I basically affects 2 functions.
    • az.summary
    • utils.flatten_inference_data_to_dict and to_cds -> do in another PR, probably better to wait until we have sound converters for dict and dataframe? @percygautam @ahartikainen
  • Use index_origin in default coord generation. Closes integrate index_origin rcParam #984
  • Think (maybe implement) ways for users to customize generated labels options that seem interesting to have are var_name: coord_val, var_name [coord_val], var_name [dim: coord_val] and the like. Maybe a label_str.format(var_name=var_name, dim=dim, coord_val=coord_val) is enough? Proposal below

Proposal: Labeler class

I propose to use a labeler class with a handful of methods to cover the different labeling tasks. The base task is done with 2 functions, 1st apply the dim+coord labeling function looping over all present dims and 2nd combining the result of this function with the variable name. This could be a single make_label function, but I think a class will be better for extensibility, moreover, a class can also handle labeling when multiple models are present (i.e. plot_forest) and the different needs between plot labels and summary labels.

Checklist

  • Follows official PR format
  • New features are properly documented (with an example if appropriate)?
  • Includes new or updated tests to cover the new feature
  • Code style correct (follows pylint and black guidelines)
  • Changes are listed in changelog

@ahartikainen
Copy link
Contributor

I'm not sure about this.

I kind of want to use 0-based indexing with pystan, and just have a summary with "correct" index.

Would this force me to use 1-based indexing?

@OriolAbril
Copy link
Member Author

Do you only want to see 1 based indexing in summary? or in plots too?

Under the proposal above the coordinate values would be 1-base indexes, to use 0-based indexing you'd have to use isel instead of sel. Otherwise, to me it feels very weird to not be able to use the labels shown in summary to index which is what currently happens with index_origin=1.

@ahartikainen
Copy link
Contributor

Yeah, I forget isel. Yeah let's go forward with this

@codecov
Copy link

codecov bot commented May 26, 2020

Codecov Report

Merging #1201 (423d8a1) into main (29b35ce) will decrease coverage by 0.24%.
The diff coverage is 90.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1201      +/-   ##
==========================================
- Coverage   90.25%   90.00%   -0.25%     
==========================================
  Files         105      107       +2     
  Lines       11419    11528     +109     
==========================================
+ Hits        10306    10376      +70     
- Misses       1113     1152      +39     
Impacted Files Coverage Δ
arviz/plots/backends/bokeh/compareplot.py 90.56% <ø> (-0.51%) ⬇️
arviz/plots/backends/bokeh/distcomparisonplot.py 100.00% <ø> (ø)
arviz/plots/backends/matplotlib/compareplot.py 86.04% <ø> (-0.92%) ⬇️
arviz/plots/khatplot.py 94.87% <ø> (ø)
arviz/plots/plot_utils.py 86.59% <ø> (-2.78%) ⬇️
arviz/plots/separationplot.py 73.68% <ø> (ø)
arviz/utils.py 89.25% <ø> (-0.04%) ⬇️
arviz/plots/backends/matplotlib/loopitplot.py 86.36% <36.36%> (-12.41%) ⬇️
arviz/labels.py 70.37% <70.37%> (ø)
arviz/sel_utils.py 88.23% <88.23%> (ø)
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 29b35ce...d64a5fd. Read the comment docs.

@OriolAbril OriolAbril force-pushed the labels branch 2 times, most recently from d6a85ff to 9175645 Compare June 3, 2020 21:06
@OriolAbril
Copy link
Member Author

There are still some from_xyz functions to update, but I think the bulk of the PR is done. I think it is particularly relevant for users of index_origin which would mostly be @ahartikainen and @sethaxen. I think this will force ArviZ.jl to do a couple updates (I can help) and I do not expect too many things to break but it would be great to double check and more importantly to make sure this approach will help better integrate index_origin argument.


Tasks that will be left pending for other PRs are:

  • update utils.flatten_inference_data_to_dict and to_cds to use coord labels (if possible)
  • add function to customize label format
  • testing?
    • Not sure if a couple tests on generate_dims_coords with index origin would be enough or if from_xyz should also be tested. any preference/idea @canyon289 ?

@canyon289
Copy link
Member

@OriolAbril Can we create issue tickets for the remaining things and merge? Would hate for you to get stuck with a merge conflict as so many files have changed

@OriolAbril
Copy link
Member Author

I could be merged, but this introduces quite some changes in labeling and index_origin usage that affect most of the library. I think it would be better to wait until it has been properly reviewed.

Most of the changes are reorganizing imports, so rebasing should be automatic or not too difficult, I don't mind a couple rebases until it gets some reviews

Copy link
Contributor

@ahartikainen ahartikainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, added some comments.

Let's discuss how can we handle those comments

arviz/data/base.py Outdated Show resolved Hide resolved
arviz/data/base.py Outdated Show resolved Hide resolved
arviz/data/base.py Outdated Show resolved Hide resolved
arviz/sel_utils.py Show resolved Hide resolved
arviz/sel_utils.py Show resolved Hide resolved
arviz/sel_utils.py Show resolved Hide resolved
arviz/sel_utils.py Outdated Show resolved Hide resolved
arviz/sel_utils.py Outdated Show resolved Hide resolved
@@ -1045,6 +1040,11 @@ def summary(
Dimensions specification for the variables to be used if the ``fmt`` is ``'xarray'``.
credible_interval: float, optional
deprecated: Please see hdi_prob
order
deprecated: order is now ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is order removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is there a way to select order for unwrapped values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having informative labels, order is not really necessary (also it is not present in any plot and nobody has complained). High dimension and confusing cases should probably better use dim: coord_value format or something like this.

One of the goals here is to make summary and plots use the same labeling and between removing order from summary or adding order to plots I went for removing order from summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is fine.

At some point, we probably want to have some way to sort the results (plots/summary etc)
For summary basic pandas works, so I think we don't need to implement anything, but for plots maybe we do need

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't we talking about returning xarray's of axes (or other backend entities) as a way to do this? TBH, I don't know if it's possible to move around the subplots in matplotlib -- if this can be done, then the xarray solution would work.

@canyon289
Copy link
Member

Hi, are there things we can do to help get this one merged? Looks like a lot of work

@OriolAbril
Copy link
Member Author

I think the current code is mostly ok and is an improvement over the actual one, not sure it's the way to go though, it still has some clear limitations and it's hard to customize the label formatting.

@ahartikainen
Copy link
Contributor

this can wait, right

@OriolAbril
Copy link
Member Author

Yes

@OriolAbril
Copy link
Member Author

I have rebased the PR and continued the work started here. I proposed to use a Labeller class. The latest commit adds the base class and some subcasses I think will be interesting. I have added some screenshots below, please review.

BaseLabeller (used by default)

image

DimCoordLabeller

image

NoRepeatLabeller

image

MapLabeller

image

arviz/labels.py Outdated Show resolved Hide resolved
arviz/sel_utils.py Outdated Show resolved Hide resolved
arviz/plots/bpvplot.py Outdated Show resolved Hide resolved
arviz/plots/backends/bokeh/bpvplot.py Outdated Show resolved Hide resolved
@OriolAbril
Copy link
Member Author

Updated several plots to see how labeling behaved in different ranges, and while I was at it I added a legend argument to plot_forest. I have set it to true by default, which combined with #1527 leaves this as default:

image

The rest of the plots work more or less like summary above but with plots, note that the screenshots have the wrong default though, we are now using var_name[coord], var_name[dim: coord] or var_name[dim#idx].

@sethaxen
Copy link
Member

I think it is particularly relevant for users of index_origin which would mostly be @ahartikainen and @sethaxen. I think this will force ArviZ.jl to do a couple updates (I can help) and I do not expect too many things to break but it would be great to double check and more importantly to make sure this approach will help better integrate index_origin argument.

What kinds of things do you think might need to be updated on the ArviZ.jl side?

@OriolAbril
Copy link
Member Author

What kinds of things do you think might need to be updated on the ArviZ.jl side?

There are two changes that I think will affect the Julia side, on the user side it should be backwards compatible in the sense that the noticeable changes will be fixing issues.

  1. index_origin things. Up until this PR, when no coord values were provided, the converters generated automatic integer values starting from 0. Then, plots and stats had an index_origin parameter that added 1 to interger coord values so that even though the coord values started at 0, plot labels would start at 1.

From here on, users can choose what to use when labeling, coords, indexes, nothing at all... and the default is set to always use coord values. index_origin is then enforced at idata creation. Therefore, stan users will set index_origin=1 and will get coordinate values starting from 1. They can then subset their data, chains, draws... with .sel using 1-based indexing or with .isel using 0-based indexing.

To keep showing 1 based indexing in plots, the julia converters should be updated to enforce index_origin at creation.

  1. is about index based labeling. Up until now, the info was simply not available to plotting functions, now it is, which means that xarray_var_iter and xarray_sel_iter return 4 elements now instead of 3.

I don't know if these are used at all on the julia side, but if they are everything will break :/

If you point me to the relevant pieces of code I can help with updating

@sethaxen
Copy link
Member

  1. index_origin things. Up until this PR, when no coord values were provided, the converters generated automatic integer values starting from 0. Then, plots and stats had an index_origin parameter that added 1 to interger coord values so that even though the coord values started at 0, plot labels would start at 1.

From here on, users can choose what to use when labeling, coords, indexes, nothing at all... and the default is set to always use coord values. index_origin is then enforced at idata creation. Therefore, stan users will set index_origin=1 and will get coordinate values starting from 1. They can then subset their data, chains, draws... with .sel using 1-based indexing or with .isel using 0-based indexing.

To keep showing 1 based indexing in plots, the julia converters should be updated to enforce index_origin at creation.

Nice! Currently ArviZ.jl handles this by setting the data.index_origin rcParam upon loading arviz:
https://github.com/arviz-devs/ArviZ.jl/blob/v0.4.11/src/setup.jl#L55
If the user doesn't supply index_origin upon creation, does it default to the rcParam, or is the rcParam going away?

  1. is about index based labeling. Up until now, the info was simply not available to plotting functions, now it is, which means that xarray_var_iter and xarray_sel_iter return 4 elements now instead of 3.

I don't know if these are used at all on the julia side, but if they are everything will break :/

No, we don't use these on the Julia side, though ArviZPlots currently does. Should be easy to update though; I'll follow how you do that for arviz's plots implementations.

@OriolAbril
Copy link
Member Author

OriolAbril commented Feb 19, 2021

If the user doesn't supply index_origin upon creation, does it default to the rcParam, or is the rcParam going away?

The rcParam is kept so users can set the default there and get the right coord values without having to set index_origin every single time they call the converter.

No, we don't use these on the Julia side, though ArviZPlots currently does. Should be easy to update though; I'll follow how you do that for arviz's plots implementations.

They are mostly called to then loop over them so I have been updating the for var_name, sel, values in xarray_... to for var_name, sel, isel, values in xarray_.... Then var_names, sel, isel are passed to the labeller class so it has everything available for labeling.

@OriolAbril
Copy link
Member Author

Ready to merge as soon as tests pass!

@OriolAbril OriolAbril merged commit 226d141 into main Feb 21, 2021
@OriolAbril OriolAbril deleted the labels branch February 21, 2021 22:59
utkarsh-maheshwari pushed a commit to utkarsh-maheshwari/arviz that referenced this pull request May 27, 2021
* Base index_origin and make_label changes

* Labeller proposal commit

Adds Labeller base class and some useful subclasses
Updated code in summary to use as example of Labeller class capabilities

* update ess and autocorr plots

* update tests

* update density and bpv plots

* update mpl forestplot

* update more plots

* update more plots and add some docs

* finish rough update of everything

* finetuning

* fix some tests

* more fixes

* last fixes? well... until docs

* more test fixes

* fix summary test

* add some documentation

* black

* update changelog

* doc tweaks

* update changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants