MemoryError issue in predict method for ConformalPredictiveSystem() #9

christopherjluke · 2023-04-03T13:40:18Z

Hello, has anyone run into an issue with using the predict method with CPS? I am just using this on a simple linear regression toy model that has about 480,000 observations, since this will need to be scaled to a way larger dataset.

I fit the normalized Conformal Predictive System following the same steps as the notebook, but when I go to run the predict method I keep getting this error:

MemoryError: Unable to allocate 939. KiB for an array with shape (120249,) and data type float64

The error was traced back to this portion of the base.py code:
--> 344 cpds = np.array([y_hat[i]+sigmas[i]*self.alphas 345 for i in range(len(y_hat))])

These were fit with the residuals for the calibration set so I am wondering if this is the issue? I have plenty of memory available to run something like this locally... like gigs worth, and the inability to allocate such an insignificant amount of memory is strange to me.

The text was updated successfully, but these errors were encountered:

christopherjluke · 2023-04-04T22:39:00Z

I have tried running the deployment in Azure with way more memory and the .predict function for the Conformal Predictive System kills the kernel. I know it is not an issue with the dataset or the code, since I tried it on a trimmed down version of the data and it worked fine. Has this been an issue with anyone else?

henrikbostrom · 2023-05-12T13:52:57Z

Thanks for pointing this out! I am not sure what the problem could be, but it would be helpful if could you show exactly what the call to the method looks like. It would also be very fine if you could try with the most recent version (0.3.0) (released after you raised this issue) and report any errors with respect to the new line numbers.

SebastianLeborg · 2023-05-25T10:15:20Z

What is the size of your test set? I've struggled with exactly these lines myself quite a bit, and if we have the same issue, this is an issue with memory complexity.

cpds = np.array([y_hat[i]+sigmas[i]*self.alphas for i in range(len(y_hat))])

self.alphas has the same shape as your calibration set whilst y_hat has the same shape as your test set. This line of code creates a copy of the entire sigmas array for every single sample in your test/forecasting set. So having a calibration and a training set with 100k samples each values would result in 74.5 gigabytes of memory being allocated assuming the float64 datatype. This is way too much for most machines causing an OOM crash.

henrikbostrom · 2023-05-26T14:20:02Z

The total number of elements in the resulting matrix is equal to the product of the number of calibration and test objects, unless a Mondrian approach is employed; for the latter, this number should be divided by the number of bins. So if you indeed want to use in the order of a hundred thousand instances for calibration, together with large test sets, a Mondrian approach is strongly suggested, using as many bins as possible while keeping the number of calibration instances in each bin sufficiently large; I think 1000 calibration instances or so would usually give enough granularity, but this, of course, depends on the application.

SebastianLeborg · 2023-05-29T08:34:34Z

100k~ datapoints in the forecasting set is "småpotatis" in many industry applications. For the calibration set, 100k samples might be a bit much without Mondrian binning as you're pointing out. But in many demand modeling applications, the sales data we use to train the model can be exponentially distributed. So lots of items that sell very low, and a few items that sell extremely high. In these cases we want to make sure that the high sellers are properly represented in the calibration set which could require a larger sample size.

I've looked through the code a bit and I see that the complete "cpds" array is only really needed (correct me if I'm wrong) if the "y" argument is supplied to the predict method (in order to calculate p(y)), or if the "return_cpds" argument is True. Perhaps we could skip calculating the complete cpds array for when we're only interested in forecasting for a set of lower/higher_percentiles? Then we could predict by calculating the alpha index(es) similar to how ConformalRegressor is doing it. Thus saving a ton of memory and time.

henrikbostrom · 2023-05-29T09:44:15Z

Thanks for looking into how the code could potentially be improved for specific use cases! Unfortunately, the "cpds" array is currently used also when extracting percentiles, so handling this particular case in a more memory-efficient way would require a bit of refactoring.

henrikbostrom · 2023-06-02T15:21:59Z

Hi again,

This issue has now been addressed in version 0.5.0; instead of always generating the cpds array, this is now only done when requested (through return_cpds=True) or when including "CRPS" among the metrics to evaluate (which currently is the default). This indeed required a bit of refactoring. You are most welcome to try this out on your (large) calibration and test sets!

Best regards,
Henrik

henrikbostrom · 2023-06-02T15:27:07Z

I hereby close the issue, as I consider it fixed by v0.5.0, but you are welcome to open a new one if you experience other limitations.

Best regards,
Henrik

henrikbostrom closed this as completed Jun 2, 2023

henrikbostrom mentioned this issue Jun 20, 2023

Avoid generating the full cpds array when calculating CRPS #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError issue in predict method for ConformalPredictiveSystem() #9

MemoryError issue in predict method for ConformalPredictiveSystem() #9

christopherjluke commented Apr 3, 2023

christopherjluke commented Apr 4, 2023

henrikbostrom commented May 12, 2023

SebastianLeborg commented May 25, 2023

henrikbostrom commented May 26, 2023

SebastianLeborg commented May 29, 2023 •

edited

Loading

henrikbostrom commented May 29, 2023

henrikbostrom commented Jun 2, 2023

henrikbostrom commented Jun 2, 2023

MemoryError issue in predict method for ConformalPredictiveSystem() #9

MemoryError issue in predict method for ConformalPredictiveSystem() #9

Comments

christopherjluke commented Apr 3, 2023

christopherjluke commented Apr 4, 2023

henrikbostrom commented May 12, 2023

SebastianLeborg commented May 25, 2023

henrikbostrom commented May 26, 2023

SebastianLeborg commented May 29, 2023 • edited Loading

henrikbostrom commented May 29, 2023

henrikbostrom commented Jun 2, 2023

henrikbostrom commented Jun 2, 2023

SebastianLeborg commented May 29, 2023 •

edited

Loading