Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JOSS review #76

Closed
2 tasks done
theorashid opened this issue Jun 15, 2022 · 14 comments
Closed
2 tasks done

JOSS review #76

theorashid opened this issue Jun 15, 2022 · 14 comments
Labels
enhancement New feature or request

Comments

@theorashid
Copy link
Contributor

theorashid commented Jun 15, 2022

This is for openjournals/joss-reviews#4455.

Thanks for the submission – it's a nice package with a clean codebase. Beyond some edits to the script, for which I opened a PR, there are some other points to consider (happy to discuss).

Before accepting:

  • Looks like jejjohnson (4,286 ++ 1,779 --) committed far more than Daniel-Dodd. The former is not on the paper, the latter is. Does their contribution warrant authorship?
  • the information in CONTRIBUTING.md should be linked in the repo README and the documentation. For those who aren't into software, it's especially important to visibly link to the where users can get support (looks like the discussion section to me). I hope this will increase usage of the package beyond the ML community too and into applied fields.

Not essential, but would be nice:

  • API which you can click through on the menu bar rather than one big dump at https://gpjax.readthedocs.io/en/latest/api.html
  • it would be good to automate linting and formatting, ideally using a pre-commit and a workflow
  • numpyro integration. numpyro is much easier to use than tfp. I understand that perhaps distrax works better with tfp, so it would therefore be nice to see distrax in numpyro and gpjax with numpyro – it's all jax
  • some more likelihoods: binomial, Poisson, Student-t etc
@theorashid theorashid added the enhancement New feature or request label Jun 15, 2022
@thomaspinder thomaspinder mentioned this issue Jun 15, 2022
7 tasks
@thomaspinder
Copy link
Collaborator

Hey @theorashid, thanks for giving us such a helpful set of comments. I'll address them in order below.

API which you can click through on the menu bar rather than one big dump at https://gpjax.readthedocs.io/en/latest/api.html

The API page has a right hand menu which is structured as module > class . May I ask what benefit you feel having a menu bar would give over the current right-hand menu?

numpyro integration. numpyro is much easier to use than tfp. I understand that perhaps distrax works better with tfp, so it would therefore be nice to see distrax in numpyro and gpjax with numpyro – it's all jax

I do not have a great deal of experience with NumPyro. We chose to offload MCMC sampling into BlackJax and optimisation to Optax as they were both quite "research focussed" packages which aligned with the principles in GPJax. I'm curious, would you mind expanding on the advantages that NumPyro integration would give in comparison to Optax and BlackJax?

the information in CONTRIBUTING.md should be linked in the repo README and the documentation. For those who aren't into software, it's especially important to visibly link to the where users can get support (looks like the discussion section to me). I hope this will increase usage of the package beyond the ML community too and into applied fields.

Good point - this has been resolved here.

it would be good to automate linting and formatting, ideally using a pre-commit and a workflow

Cool suggestion. A pre-commit automation that uses the black formatter has been added here.

some more likelihoods: binomial, Poisson, Student-t etc

This is on our radar. As is a guide detailing custom implementation. We would like to give it a bit more thought though as our end goal is to have an API such that one can simply inherit a Distrax distribution and define one or two methods.

Looks like jejjohnson (4,286 ++ 1,779 --) committed far more than Daniel-Dodd. The former is not on the paper, the latter is. Does their contribution warrant authorship?

A major reason for jejjohnson's large number of committed lines of code is that we used .ipynb files for notebooks originally, not .pct.py files. Therefore, each time a notebook was run, the corresponding commit would have hundreds of lines corresponding to the notebook's output. This also explains why I have such a large number of committed lines. jejjohnson contributed a series of functions that enabled integration with NumPyro. However, this was in a very early version of GPJax (last commit was March 2021) and the package has evolved significantly since then. Many of the commits made by Daniel-Dodd have been attributed to the user Daniel Dodd due to a changing email address. The true number of lines committed by Daniel-Dodd is actually ++2813 --2012. This line count does not truly reflect the contributions made by Daniel-Dodd though as he is the main contributor the sparse framework within GPJax and was heavily influential in curating the example notebooks within the documentation. I am very happy to discuss the criteria for authorship more if you disagree.

@theorashid
Copy link
Contributor Author

Many of those points are fair and I've ticked them off. I don't have much experience with pre-commit but make sure it runs for every PR, and maybe add it to the CONTRIBUTING.md. Glad it was easy enough to implement.

May I ask what benefit you feel having a menu bar would give over the current right-hand menu?

Ah okay, I didn't notice the toggle for the menu in the top right. I see it now and this is all good. If there's a way to set it as open by default, that would be great.

I do not have a great deal of experience with NumPyro. We chose to offload MCMC sampling into BlackJax and optimisation to Optax as they were both quite "research focussed" packages which aligned with the principles in GPJax. I'm curious, would you mind expanding on the advantages that NumPyro integration would give in comparison to Optax and BlackJax?

It's more for the cases when you want to use GPs as part of a larger model. In spatial modelling, perhaps you only want a Gaussian process over space in a larger model with several covariates added as $\beta*covariate$. Although, now searching for this, I'm not sure how many examples there are of these beyond forum posts or custom GP implementations. Perhaps the editor @dfm has better ideas with this and whether it is achievable.

As for NumPyro specifically, I just prefer using it for model building to tfp. And it's written in jax. So, taking from your tfp example, it would be nice if you could replace

-priors["kernel"]["lengthscale"] = tfd.Gamma(concentration=jnp.array(1.0), rate=jnp.array(1.0))
+priors["kernel"]["lengthscale"] = numpyro.sample("rho", numpyro.dist.Gamma(1.0, 10.0))

and equally the numpyro mcmc interface is much simpler imo than tfp with it's confusing trace_fn argument, where you can just use mcmc.run(). This would encourage users from applied fields with less experience of mcmc.

Personally, I would always use blackjax for mcmc over tfp, especially with their new API.

We would like to give it a bit more thought though as our end goal is to have an API such that one can simply inherit a Distrax distribution and define one or two methods.

I guess you could look towards how the tingp library has done this with numpyro.

@thomaspinder
Copy link
Collaborator

Ah okay, I didn't notice the toggle for the menu in the top right. I see it now and this is all good. If there's a way to set it as open by default, that would be great.

I will investigate this.

It's more for the cases when you want to use GPs as part of a larger model.

Yes, very cool. I really like this idea. If you were interested, I'd be very interested to talk more about the logistics of doing this with a nice real-world example in GPJax.

To your points on NumPyro and likelihood design. I think to do them cleanly and correctly will be quite a large piece of work. Over the next month, I'm away at conferences quite a bit and writing up my thesis, so will struggle to properly sit down and work out how best to do this. How much of a priority are they to you with regards to accepting GPJax? If they're crucial, then I can look at re-jigging my schedule a little to make time for them.

@theorashid
Copy link
Contributor Author

Sorry if it wasn't clear: only the two points on the checklist at the top and the paper were essential to accepting the submission. In my view, the codebase is more than ready to be accepted – I just need to read and edit the latest version of the manuscript. The numpyro stuff and menu are not essential at all. Obviously there's another reviewer but great job.

Yes, very cool. I really like this idea. If you were interested, I'd be very interested to talk more about the logistics of doing this with a nice real-world example in GPJax.

I'd be really interested – particularly for my own work. I think a spatiotemporal problem would be a nice way to involve that community with your package. A good example dataset would be the Hungary Chickenpox dataset . I would like to do this with a graph GP over space kroneckered with a GP over time (see Fast Hierarchical GPs) if that's possible. It's also a Poisson likelihood and we could try and add a covariate in for a larger model. Perhaps when you're more free, we could sit down for a chat about implementing this – many in my spatiotemproal epidemiology crowd would be interested.

Also, leave this issue open for now and I'll put any comments I have about the paper here before closing it.

@thomaspinder
Copy link
Collaborator

I'd be really interested – particularly for my own work.

Great. Very interested to explore this more, particularly in relation to the graph component with a temporal component. I have opened a discussion thread #80 for us to continue publicly discussing this idea further.

Also, leave this issue open for now and I'll put any comments I have about the paper here before closing it.

Sure! Look forward to anymore thoughts you have on the package.

@theorashid
Copy link
Contributor Author

theorashid commented Jun 20, 2022

Paper looks in good shape. Notes:

  • The two R packages you cited are very niche and I think you can drop them. I still can't believe the lack of GP software for R. Having done some digging, I found a paper for GPfit in R, which cites two older packages tgp and mlegp. GPfit hasn't been updated for 3 years, and the other two look niche and not widely used. I asked my supervisor and he said GPstuff has a port for R, and people also use INLA and mgcv. I would probably say "...include GPML and GPstuff in Matlab, the latter is also available in R, GaussianProcess.jl..." (word however you want, how I did it there is quite clunky). Or maybe ask some R users what they do. There must be a lot of custom GP code
  • Stheno also has a python version with jax functionality. This is worth mentioning in the bayesnewton/tinygp section, because Stheno does have inducing point functionality. I haven't used Stheno much before, and I'm sure there are plenty of things GPJax does better.
  • Up to you who you acknowledge – perhaps Erik Bodin's 2 lines were life-changing. But you can do what you want in this section

@thomaspinder
Copy link
Collaborator

Hey @theorashid. I've just update the paper to reflect your most recent comments. My labmate works exclusively in R and tells me all of his GP code is written from scratch. Besides Stan, he was also unaware of any popular R packages for GPs. With that in mind, I've left the R acknowledgements to the GPStuff port that you've flagged above.

@theorashid
Copy link
Contributor Author

Looks like a minor typo "theGaussianProcesses,jl". Worth doing a final check that the references are rendering as you expect too

Once that is fixed, I think we just need to wait for the other reviewer's comments.

@daniel-dodd
Copy link
Member

Hey @theorashid. Many thanks for your constructive comments. The NumPyro idea is cool!

@theorashid, @thomaspinder, I have conducted a pass through the paper with fresh eyes and have condensed it down. All changes are on the master branch (apologies - forgot to create a PR - the commits in order are 6c26920, dc20335 & 11806f6). The major change was collapsing the Summary section into a single paragraph (to alleviate repetitions) and I addressed the typo you mentioned above @theorashid. Keen to hear both your thoughts on this.

@thomaspinder, please are you able to correct the formatting of (e.g., (Hensman et al., 2013)) to (e.g., Hensman et al., 2013)? (I am unsure how to do this...)

@thomaspinder
Copy link
Collaborator

Looks like a minor typo "theGaussianProcesses,jl". Worth doing a final check that the references are rendering as you expect too

Once that is fixed, I think we just need to wait for the other reviewer's comments.

@theorashid Nice catch. The typo has been fixed above by Daniel.

@daniel-dodd happy with your changes and the references are now updated, as per your request.

@gpleiss
Copy link

gpleiss commented Jun 30, 2022

Hi @thomaspinder and @daniel-dodd, here's some additional items from my review:

  • Issue: GPU setup is a bit challenging. To get GPU support, I have to pip install --update jax[cuda] before installing GPJax. However, GPJax's requirements lock the Jax version to a specific release (0.3.2). I got installation to work, but it's a bit clunky. I would recommend that the authors:
    • Create a CUDA specific install (e.g. pip install "gpjax[cuda]") that would install `jax[cuda])
    • Make the requirements less strict (unless there is a good reason to do so?
    • It's possible that there's an easier way to get CUDA support (I haven't used JAX much!). @thomaspinder I'm curious to hear your thoughts.
  • Nice to have: Since the software is designed to pair well with the GPML book, it would be beneficial to include more of the kernels discussed in that book. Right now the set of kernels is limited (although the included kernels are arguably the most useful). Off the top of my head, it would be useful to include the RQ kernel, the periodic kernel, the linear kernel, and maybe something more advanced like the Gibbs kernel.
  • Nice to have: It would be good to have at least one code example with higher dimensional data (e.g. a UCI dataset). I recognize that 1D data is much nicer to visualize, but having a "real dataset" would make for a more comprehensive example, especially for very new users.
  • Issue: You should set up a GitHub action to automatically run your tests for PRs and commits to master.
  • Suggestion: It could be useful to have some larger end-to-end tests (beyond the unit tests) to make sure that all the pieces stay glued together. The pyro folks have a very clever way of doing this: they run all of their example notebooks as a smoke test.
  • Nit: the "statement of need" in the paper doesn't actually address the need; that comes in the "wider software ecosystem" section. I would suggest that the authors reorganize these two sections?

Overall, I'm really excited about this library, excellent work! I think the things marked as issue are important, but everything else is merely suggestion.

@thomaspinder
Copy link
Collaborator

Thanks @gpleiss for such helpful comments! I’m currently away at ISBA, so I‘ll address your comments on Monday once I’m home. Having read your comments though, I’m confident that we can resolve them all promptly.

@thomaspinder
Copy link
Collaborator

Hi @gpleiss.
I have resolved your issues now with a few follow up questions/comments.

  • I have added a [cuda] flag to GPJax's requirements that installs jax[cuda], as you suggested. Looking at other popular Jax-based libraries though, it seems as though most simply install regular Jax e.g., jraph and BrainPy, the latter having a cuda flag that isn't actioned. Despite this though, I agree with your suggestion of adding a cuda flag, however, in your experience with Torch/GPyTorch, has supporting such functionality ever caused headaches?
  • Additional kernels and a high-dimensional example are currently being developed here
  • My understanding was that the Github action in .github/workflows/workflow-master.yml runs unit tests on push and PR to master branch. Did you have something different in mind?
  • I have now included a UCI regression benchmark task for the yacht dataset (notebook here).
  • The paper has been refactored to better respond to the statement of need.
  • I very much like your suggestion for larger end-to-end tests. I will sit down with Dan and sketch out what these may look like.

Thanks again for providing helpful comments as a reviewer.

@gpleiss
Copy link

gpleiss commented Jul 20, 2022

Thanks for addressing my comments!

I have added a [cuda] flag to GPJax's requirements that installs jax[cuda], as you suggested. Looking at other popular Jax-based libraries though, it seems as though most simply install regular Jax e.g., jraph and BrainPy, the latter having a cuda flag that isn't actioned. Despite this though, I agree with your suggestion of adding a cuda flag, however, in your experience with Torch/GPyTorch, has supporting such functionality ever caused headaches?

Hmm... I personally haven't noticed a headache with Torch/GPyTorch, and I don't think we've ever had any GitHub issues about cuda installation. There probably is a better pattern out there for cuda installation, but I don't think that this project has to figure it out 😄

My understanding was that the Github action in .github/workflows/workflow-master.yml runs unit tests on push and PR to master branch. Did you have something different in mind?

Oops! I must have missed this. The GitHub action is exactly what I had in mind.

I have now included a UCI regression benchmark task for the yacht dataset (notebook here).

The notebook looks awesome! I really like the diagnostic plots (and I am definitely stealing the idea for our gpytorch tutorials 😉)

I very much like your suggestion for larger end-to-end tests. I will sit down with Dan and sketch out what these may look like.

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants