Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a sparse matrix for group specific effects #545

Open
hans-ekbrand opened this issue Jul 12, 2022 · 0 comments
Open

Use a sparse matrix for group specific effects #545

hans-ekbrand opened this issue Jul 12, 2022 · 0 comments

Comments

@hans-ekbrand
Copy link

hans-ekbrand commented Jul 12, 2022

Background: I recently observed large RAM usage on a model, and opened an issue on pymc discourse, and in that thread Tomás Capretto responded:

"On the other hand, Bambi relies on formulae to generate design matrices, which turns out to generate a regular matrix for group-specific effects. In this case, it’s a very large matrix with almost all zeros, so a sparse matrix would have been better. So I think this explains the large memory comsumption. It is something we still need to improve on our end. However, this matrix is not directly used in the PyMC model, we use slicing to select only non-zero values."

Today this issue hit me again, and this time the RAM requirements were absurd:

my_model = bmb.Model("deprived.of.education ~ (sex|country) +    (sex|cluster.id.unique) + per.cent.muslim.in.country*sex*gdp.log + per.cent.hindu.in.country*sex*gdp.log + per.cent.muslim.in.cluster*sex*wealth.at.cluster.level + per.cent.hindu.in.cluster*sex*wealth.at.cluster.level + wealth * sex * religion + urbrur", df, family="bernoulli", dropna=True)

Automatically removing 739811/2772473 rows from the dataset.
Unexpected error while trying to evaluate a Variable. <class 'numpy.core._exceptions._ArrayMemoryError'>

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\hanse\miniconda3\lib\site-packages\bambi\models.py", line 147, in __init__
    self._design = design_matrices(formula, data, na_action, 1, extra_namespace)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\matrices.py", line 523, in design_matrices
    design = DesignMatrices(description, data, env)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\matrices.py", line 54, in __init__
    self.model.eval(data, env)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\terms\terms.py", line 1261, in eval
    term.set_data(encoding)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\terms\terms.py", line 665, in set_data
    self.factor.set_data(True)  # Factor is a categorical term that always spans the intercept
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\terms\terms.py", line 468, in set_data
    component.set_data(spans_intercept_)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\terms\variable.py", line 109, in set_data
    self.eval_categoric(self._intermediate_data, spans_intercept)
  File "C:\Users\hanse\miniconda3\lib\site-packages\formulae\terms\variable.py", line 166, in eval_categoric
    value = self.contrast_matrix.matrix[x.codes]
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 857. GiB for an array with shape (2032662, 113158) and data type int32

There are two random slopes in the model, the first grouping factor has 88 levels, and the second grouping factor has 145861 levels, so the model in itself is not very big. I am happy to provide the data if it is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants