Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError issue in predict method for ConformalPredictiveSystem() #9

Closed
christopherjluke opened this issue Apr 3, 2023 · 8 comments

Comments

@christopherjluke
Copy link

Hello, has anyone run into an issue with using the predict method with CPS? I am just using this on a simple linear regression toy model that has about 480,000 observations, since this will need to be scaled to a way larger dataset.

I fit the normalized Conformal Predictive System following the same steps as the notebook, but when I go to run the predict method I keep getting this error:

MemoryError: Unable to allocate 939. KiB for an array with shape (120249,) and data type float64

The error was traced back to this portion of the base.py code:
--> 344 cpds = np.array([y_hat[i]+sigmas[i]*self.alphas 345 for i in range(len(y_hat))])

These were fit with the residuals for the calibration set so I am wondering if this is the issue? I have plenty of memory available to run something like this locally... like gigs worth, and the inability to allocate such an insignificant amount of memory is strange to me.

@christopherjluke
Copy link
Author

I have tried running the deployment in Azure with way more memory and the .predict function for the Conformal Predictive System kills the kernel. I know it is not an issue with the dataset or the code, since I tried it on a trimmed down version of the data and it worked fine. Has this been an issue with anyone else?

@henrikbostrom
Copy link
Owner

Thanks for pointing this out! I am not sure what the problem could be, but it would be helpful if could you show exactly what the call to the method looks like. It would also be very fine if you could try with the most recent version (0.3.0) (released after you raised this issue) and report any errors with respect to the new line numbers.

@SebastianLeborg
Copy link

What is the size of your test set? I've struggled with exactly these lines myself quite a bit, and if we have the same issue, this is an issue with memory complexity.

cpds = np.array([y_hat[i]+sigmas[i]*self.alphas for i in range(len(y_hat))])

self.alphas has the same shape as your calibration set whilst y_hat has the same shape as your test set. This line of code creates a copy of the entire sigmas array for every single sample in your test/forecasting set. So having a calibration and a training set with 100k samples each values would result in 74.5 gigabytes of memory being allocated assuming the float64 datatype. This is way too much for most machines causing an OOM crash.

@henrikbostrom
Copy link
Owner

The total number of elements in the resulting matrix is equal to the product of the number of calibration and test objects, unless a Mondrian approach is employed; for the latter, this number should be divided by the number of bins. So if you indeed want to use in the order of a hundred thousand instances for calibration, together with large test sets, a Mondrian approach is strongly suggested, using as many bins as possible while keeping the number of calibration instances in each bin sufficiently large; I think 1000 calibration instances or so would usually give enough granularity, but this, of course, depends on the application.

@SebastianLeborg
Copy link

SebastianLeborg commented May 29, 2023

100k~ datapoints in the forecasting set is "småpotatis" in many industry applications. For the calibration set, 100k samples might be a bit much without Mondrian binning as you're pointing out. But in many demand modeling applications, the sales data we use to train the model can be exponentially distributed. So lots of items that sell very low, and a few items that sell extremely high. In these cases we want to make sure that the high sellers are properly represented in the calibration set which could require a larger sample size.

I've looked through the code a bit and I see that the complete "cpds" array is only really needed (correct me if I'm wrong) if the "y" argument is supplied to the predict method (in order to calculate p(y)), or if the "return_cpds" argument is True. Perhaps we could skip calculating the complete cpds array for when we're only interested in forecasting for a set of lower/higher_percentiles? Then we could predict by calculating the alpha index(es) similar to how ConformalRegressor is doing it. Thus saving a ton of memory and time.

@henrikbostrom
Copy link
Owner

Thanks for looking into how the code could potentially be improved for specific use cases! Unfortunately, the "cpds" array is currently used also when extracting percentiles, so handling this particular case in a more memory-efficient way would require a bit of refactoring.

@henrikbostrom
Copy link
Owner

Hi again,

This issue has now been addressed in version 0.5.0; instead of always generating the cpds array, this is now only done when requested (through return_cpds=True) or when including "CRPS" among the metrics to evaluate (which currently is the default). This indeed required a bit of refactoring. You are most welcome to try this out on your (large) calibration and test sets!

Best regards,
Henrik

@henrikbostrom
Copy link
Owner

I hereby close the issue, as I consider it fixed by v0.5.0, but you are welcome to open a new one if you experience other limitations.

Best regards,
Henrik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants