Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bayesian Optimisation to AB testing example #1250

Merged
merged 19 commits into from
Jul 25, 2018
Merged

Add Bayesian Optimisation to AB testing example #1250

merged 19 commits into from
Jul 25, 2018

Conversation

ae-foster
Copy link
Contributor

@ae-foster ae-foster commented Jul 20, 2018

To complete our first OED example, I've added a GP-based Bayesian optimizer based on http://pyro.ai/examples/bo.html . This uses Thompson sampling to acquire new data in batches.

Summary of changes:

  • ab_test.py - optimizing the APE using GPBayesOptimizer
  • gp_bayes_opt.py - new code implementing the BO example with Thompson sampling: probably need to move this somewhere else or can we merge it with existing BO code?
  • gpr.py - added a method to lazily create a multivariate sample from a Gaussian process posterior

@ae-foster ae-foster added the WIP label Jul 20, 2018
@ae-foster ae-foster requested review from eb8680 and fehiepsi July 20, 2018 01:48
@fritzo
Copy link
Member

fritzo commented Jul 20, 2018

@ae-foster to get CI tests to run, you'll need to merge dev into your branch and then push.

@ae-foster ae-foster force-pushed the ab-test-bayes-opt branch from d746259 to c08851a Compare July 20, 2018 18:26
Copy link
Contributor Author

@ae-foster ae-foster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some arbitrary numerics decisions in here. Also would be nice to make the GP-BO and iter_sample more "pyronic"

# transform x to an unconstrained domain
unconstrained_x_init = transform_to(self.constraints).inv(x_init)
unconstrained_x = torch.tensor(unconstrained_x_init, requires_grad=True)
minimizer = optim.LBFGS([unconstrained_x], max_eval=20)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cap evals at 20 to avoid some numerical instability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a remind comment here to use LBFGS with line_search when pytorch/pytorch#8824 is reviewed and merged? It might be more stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :) I didn't realize that was in the pipeline


def closure():
minimizer.zero_grad()
if (torch.log(torch.abs(unconstrained_x)) > 25.).any():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat arbitrary cutoff- necessary to prevent overflow during LBFGS run

# ax1.set_xlim(-1, 101)
# ax1.set_title("Find {}".format(xlabel))
# if with_title:
# ax1.set_ylabel("Gaussian Process Regression")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented code is unsightly but useful for debugging. test_examples will not admit matplotlib in example code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to use local imports for plotting dependencies:

def plot(self, gs, ...):
    from matplotlib import pyplot as plt
    ...

That's not perfect, but it's better than commented-out code 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably just add a separate test for this and remove the test code here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the code, keep it on a local branch

# No noise, just jitter for numerical stability
Kffnew[N, N] = end + self.jitter
# Heuristic to avoid adding degenerate points
if Kffnew.logdet() > -15.:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary cutoff: prevent Kff becoming singular.

# Todo use pyro.sample
d = normal.Normal(torch.tensor(0.), torch.tensor(1.))
# Reparametrize explicitly - aids autograd
ynew = (loc + self.mean_function(xnew)) + d.sample()*cov.sqrt()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make this a true pyronic sampler if possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you write torch.distributions.Normal(loc + self.mean_function(xnew), cov.sqrt()).rsample()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, had it in my mind that that didn't work- it does

X = torch.cat([self.gpmodel.X, X])
y = torch.cat([self.gpmodel.y, y])
self.gpmodel.set_data(X, y)
self.gpmodel.optimize()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very pyronic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some GP magic to make these updates cheaper than recomputing the full posterior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes depends how much magic you want. Here https://github.com/uber/pyro/blob/dev/pyro/contrib/gp/models/gpr.py#L80 we could have cached Kff and just add the kernel computations we need. One step further is to use the magic of Schur complements https://en.wikipedia.org/wiki/Schur_complement to avoid having to invert the whole kernel matrix. I actually started implementing a Schur complement method for iter_sample but in the end the existing solution works and reuses more existing code. In either case, we would have to refactor internals of GPRegression

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open a separate issue about this, it seems like generally useful functionality for contrib.gp

Copy link
Member

@fehiepsi fehiepsi Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is not expensive for Bayesian Optimization because num_data in BO is small. When num_data is large, I guess the most expensive operator is Cholesky(Kff).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other important factor is how you plan to add points to the GP- all at once or drip by drip. In the all-at-once setting, the current code is optimal. You want to avoid explicitly inverting Kff and so using Cholesky plus trtrs is the best. On the other hand, if you are going drip by drip, you can invert the small kernel matrices formed from the new points, and then get the overall precision matrix using Schur complements. But yh, I'll open an issue about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I think the algorithmic content is pretty solid, most of my comments are about code organization.

# Todo use pyro.sample
d = normal.Normal(torch.tensor(0.), torch.tensor(1.))
# Reparametrize explicitly - aids autograd
ynew = (loc + self.mean_function(xnew)) + d.sample()*cov.sqrt()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you write torch.distributions.Normal(loc + self.mean_function(xnew), cov.sqrt()).rsample()?

@@ -115,7 +115,7 @@ def loss_and_grads(self, model, guide, *args, **kwargs):

if trainable_params and getattr(surrogate_elbo_particle, 'requires_grad', False):
surrogate_loss_particle = -surrogate_elbo_particle / self.num_particles
surrogate_loss_particle.backward()
surrogate_loss_particle.backward(retain_graph=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has some performance and memory implications. I'll push a fix today as part of #1227 that exposes a differentiable_loss function that you can use when you need retain_graph=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kk I'll rebase from dev once that is merged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in #1252

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this change and switch to using TraceEnum_ELBO below

:rtype: function
"""
# Make these visible in the inner function
global X, y, Kff, N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using global here and below, how about making sample_next take these variables as arguments and returning a curried function with functools.partial?

def sample_next(X, y, Kff, N, xnew):
    ...

return functools.partial(sample_next, X, y, Kff, xnew)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't do that was that we change X inside the inner function (by writing X = Xnew). I thought scoping rules would mean that that change wasn't saved, but honestly I didn't actually try it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will LBFGS add many samples to the globals "X", "Y" during its optimization for sampler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will, but I passed max_eval=20 to cap this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also put these variables into a dictionary and mutate that instead of using global

X = torch.cat([self.gpmodel.X, X])
y = torch.cat([self.gpmodel.y, y])
self.gpmodel.set_data(X, y)
self.gpmodel.optimize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some GP magic to make these updates cheaper than recomputing the full posterior?

:rtype: torch.Tensor
"""

if method == "Thompson":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's split this out into an acquisition function parameter instead of implementing acquisition functions inside acquire, so it's easier to experiment with different Monte Carlo acquisition functions.

from torch.distributions import transform_to


class GPBayesOptimizer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to refactor this a bit into a pyro.optim.multi.MultiOptimizer so that it looks more like other Pyro optimizers. That way when we experiment with other optimizers e.g. gradient-based optimizers it'll be easy to swap them out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're best to talk over this together- we need to implement a step function which would replace my run function. Would be also need to refactor the example to use pyro.param? Right now, it's not really true that after a step, the params are updated to new near-optimal values. But obviously I could add self.opt_differentiable(lambda x: self.gpmodel(x)[0]) into each step (minimize the current GP mean function)

# ax1.set_xlim(-1, 101)
# ax1.set_title("Find {}".format(xlabel))
# if with_title:
# ax1.set_ylabel("Gaussian Process Regression")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably just add a separate test for this and remove the test code here


class GPBayesOptimizer:
"""Performs Bayesian Optimization using a Gaussian Process as an
emulator for the unknown function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the code is a little more settled down, you should expand this docstring and add a couple usage examples

@@ -0,0 +1,164 @@
# import matplotlib.gridspec as gridspec
# import matplotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these as discussed below

Copy link
Member

@fehiepsi fehiepsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ae-foster Am I understand correctly that LBFGS will add many samples to the globals "X", "Y" during its optimization for sampler and this is how Thomson sampling works?

X = torch.cat([self.gpmodel.X, X])
y = torch.cat([self.gpmodel.y, y])
self.gpmodel.set_data(X, y)
self.gpmodel.optimize()
Copy link
Member

@fehiepsi fehiepsi Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is not expensive for Bayesian Optimization because num_data in BO is small. When num_data is large, I guess the most expensive operator is Cholesky(Kff).

# transform x to an unconstrained domain
unconstrained_x_init = transform_to(self.constraints).inv(x_init)
unconstrained_x = torch.tensor(unconstrained_x_init, requires_grad=True)
minimizer = optim.LBFGS([unconstrained_x], max_eval=20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a remind comment here to use LBFGS with line_search when pytorch/pytorch#8824 is reviewed and merged? It might be more stable.

:rtype: tuple
"""

x_init = self.gpmodel.X[-1:].new_empty(1).uniform_(self.constraints.lower_bound,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for X[-1:] here, just X is enough

candidates = []
values = []
for j in range(num_candidates):
x, y = self.find_a_candidate(differentiable, x_init)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use different init points for each candidate?

"""

# Initialize the return tensor
X = torch.zeros(num_acquisitions, *self.gpmodel.X.shape[1:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is better to use self.gpmodel.new_empty here

"""
# transform x to an unconstrained domain
unconstrained_x_init = transform_to(self.constraints).inv(x_init)
unconstrained_x = torch.tensor(unconstrained_x_init, requires_grad=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use new_tensor here

:rtype: function
"""
# Make these visible in the inner function
global X, y, Kff, N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will LBFGS add many samples to the globals "X", "Y" during its optimization for sampler?

conditioning on previously sampled values.
"""
if torch.isnan(xnew).any():
raise ValueError("Cannot evaluate GP at value: {}".format(xnew))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you might want to use warn_if_nan https://github.com/uber/pyro/blob/dev/pyro/util.py#L49

vi_parameters={
"guide": guide,
"optim": optim.Adam({"lr": 0.0025}),
"loss": Trace_ELBO(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use "loss": TraceEnum_ELBO(strict_enumeration_warning=False).differentiable_loss here instead

fehiepsi
fehiepsi previously approved these changes Jul 24, 2018
Copy link
Member

@fehiepsi fehiepsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not reviewed the Thompson sampling algorithm because I am not familiar with it - but overall LGTM

@ae-foster
Copy link
Contributor Author

We'll need #1252 merged for the CI to pass (currently getting the pytorch error on example). After that I think we're good to go

@ae-foster ae-foster force-pushed the ab-test-bayes-opt branch from 059832a to 0fffde8 Compare July 25, 2018 01:45
@ae-foster
Copy link
Contributor Author

@eb8680 have the CI passing now :)

@eb8680 eb8680 merged commit b99cdf4 into dev Jul 25, 2018
@eb8680 eb8680 deleted the ab-test-bayes-opt branch July 25, 2018 20:56
breakdawn pushed a commit to breakdawn/pyro that referenced this pull request Aug 15, 2018
* Add Bayesian optimisation to AB test example

* Fix bug- cannot mean-centre like that

* Move GP optim to separate file, add batch acquisition

* Thompson sampling - messy version

* Tidy up BO for OED

* Lint

* Remove comments

* Run both estimated and true APE in example

* Clean up from rebase

* Cap LBFGS max_evals, some numerics magic for benefit of example

* Renamed arg

* Remove commented code

* Don't need to manually reparametrise

* Don't offer keyword arguments to acquire, call it acquire_thompson instead

* Address further review comments

* Make GPBayesOptimizer a MultiOptimizer; warn_if_nan

* Remove global keyword, replace with dict

* Put retain_graph=True in svi

* GP uses right loss
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants