Efficient bootstrapping of data arrays? #9299

joshdorrington · 2024-07-31T15:05:16Z

joshdorrington
Jul 31, 2024

Hi, I am working to make my big data analysis more efficient and I am facing a frequent bottleneck with bootstrap resampling which I'd like to improve. I think it is so slow because because of all the out-of-order memory accessing. I attach a simple test case showing what I mean. For a realistic dataset it takes about 3 seconds per bootstrap, which is about 20 minutes for 400 resamples. The rest of my pipeline runs in 10 minutes, so this is a major delay.

If I try using dask for this, I end up with massive graphs with thousands of layers, and that ends up being very inefficient. Any advice or clever tricks are much appreciated!

import xarray as xr
import numpy as np

#Create convenient example
example_da=xr.DataArray(data=np.zeros([1000,50,50]),coords=dict(time=range(1000),lat=range(50),lon=range(50)),dims=['time','lat','lon'])

#Realistic example which takes about 3s per bootstrap
#example_da=xr.DataArray(data=np.zeros([8000,50,400]),coords=dict(time=range(8000),lat=range(50),lon=range(400)),dims=['time','lat','lon'])

example_da.to_netcdf('example_da.nc')
example_da.close()

##Now run the bootstrap
D=xr.open_dataarray('example_da.nc').load()

Bs=[]
bootstrap_num=10 #in reality, ~400
for i in range(bootstrap_num):
    print(i)
    
    #generate some bootstrap criteria, here just randomly dividing the data into two unequal bins
    ix=np.array([*np.zeros(950),*np.ones(50)])
    
    #For the more realistic example
    #ix=np.array([*np.zeros(7800),*np.ones(200)])
    
    np.random.shuffle(ix)

    #compute time averages for each bin
    B=xr.concat([D[ix==0].mean('time'), D[ix==1].mean('time')],'index_val')
    Bs.append(B)
    
#Combine into a single array for non-parametric estimates of null hypothesis distribution.
bootstrap_da=xr.concat(Bs,'bootnum')

keewis · 2024-08-11T08:46:13Z

keewis
Aug 11, 2024
Maintainer

It's usually better (as in: easier to respond to) to paste the code in markdown instead of attaching a file:

```python
# code
```

I've taken the liberty of editing your post.

I can see two things that might be slow:

you're unpacking numpy arrays into a list, then create an array from that list. The bigger your array is, the slower this will become. Instead, I suggest you look at np.concatenate, which should be much faster. Also, if you only have 0s and 1s in ix, converting the array to bool directly will avoid the explicit comparison. In total, I believe you want some like this:

ix = np.concatenate([np.zeros(950, dtype=bool), np.ones(50, dtype=bool)], axis=0)

xr.concat without compat="override": the default, compat="equals", will compare all coordinates before deciding not to concatenate. If you know your coordinates are equal, we can skip that by passing compat="override", coords="minimal" (and for Dataset objects you'd also need data_vars="minimal").

I'd expect (2) to have the biggest impact.

1 reply

joshdorrington Aug 12, 2024
Author

Thanks for the edit.

I've dug into this more, and for big arrays, concat is definitely the bottleneck but the compat keyword doesnt seem to make much difference.
We can set up a medium sized array (about 1 Gb) and a randomised index:

import xarray as xr
import numpy as np

D=xr.DataArray(data=np.zeros([8000,50,400]),
    coords=dict(time=range(8000),lat=range(50),lon=range(400)),
    dims=['time','lat','lon'])

ix=np.concatenate([np.zeros(7800, dtype=bool), np.ones(200, dtype=bool)], axis=0)
np.random.shuffle(ix)

We can then time the two concat versions:

%timeit -r 20 B=xr.concat([D[ix==0].mean('time'), D[ix==1].mean('time')],'index_val')

3.68 s ± 407 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 B=xr.concat([D[ix==0].mean('time'), D[ix==1].mean('time')],'index_val',compat='override',coords='minimal')

3.53 s ± 139 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

So both approaches are basically the same.
When I try and work out what the bottleneck is down to, its mostly just low level numpy calls, but both approaches waste 20-40% of the runtime on time.sleep calls, which is pretty weird:

%prun  B=xr.concat([D[ix==0].mean('time'), D[ix==1].mean('time')],'index_val')

     6171 function calls (6030 primitive calls) in 3.528 seconds

     Ordered by: internal time

     ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          3    1.464    0.488    4.712    1.571 {built-in method time.sleep}
        5/1    0.575    0.115    0.000    0.000 {method 'poll' of 'select.epoll' objects}
          2    0.331    0.166    0.331    0.166 indexing.py:1502(_oindex_get)
         12    0.299    0.025    0.299    0.025 {method 'reduce' of 'numpy.ufunc' objects}
          4    0.268    0.067    0.268    0.067 {built-in method numpy.array}
          6    0.192    0.032    0.767    0.128 selectors.py:451(select)
          2    0.170    0.085    2.213    1.106 nanops.py:122(nanmean)
          2    0.138    0.069    2.043    1.022 _nanfunctions_impl.py:968(nanmean)
          2    0.046    0.023    1.311    0.656 _nanfunctions_impl.py:70(_replace_nan)
          1    0.011    0.011    0.106    0.106 <string>:1(<module>)
          1    0.009    0.009    0.009    0.009 {method 'execute' of 'sqlite3.Connection' objects}

%prun B=xr.concat([D[ix==0].mean('time'), D[ix==1].mean('time')],'index_val',compat='override',coords='minimal')

     5571 function calls (5457 primitive calls) in 3.483 seconds

     Ordered by: internal time

     ncalls  tottime  percall  cumtime  percall filename:lineno(function)
          2    0.676    0.338    0.858    0.429 indexing.py:366(__getitem__)
          4    0.668    0.167    0.668    0.167 {built-in method numpy.array}
          2    0.626    0.313    0.626    0.313 {built-in method time.sleep}
          2    0.334    0.167    1.555    0.778 _nanfunctions_impl.py:70(_replace_nan)
         10    0.317    0.032    0.317    0.032 {method 'reduce' of 'numpy.ufunc' objects}
          2    0.182    0.091    0.182    0.091 indexing.py:1502(_oindex_get)
        3/1    0.175    0.058    0.000    0.000 {method 'poll' of 'select.epoll' objects}
          1    0.171    0.171    3.483    3.483 <string>:1(<module>)
          2    0.163    0.081    2.256    1.128 nanops.py:122(nanmean)
          2    0.148    0.074    2.093    1.047 _nanfunctions_impl.py:968(nanmean)
          2    0.007    0.003    0.007    0.003 {method '__exit__' of 'sqlite3.Connection' objects}
          1    0.005    0.005    0.005    0.005 {method 'execute' of 'sqlite3.Connection' objects}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient bootstrapping of data arrays? #9299

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Efficient bootstrapping of data arrays? #9299

joshdorrington Jul 31, 2024

Replies: 1 comment · 1 reply

keewis Aug 11, 2024 Maintainer

joshdorrington Aug 12, 2024 Author

joshdorrington
Jul 31, 2024

Replies: 1 comment 1 reply

keewis
Aug 11, 2024
Maintainer

joshdorrington Aug 12, 2024
Author