Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray.open_dateset uses more than double the size of the file itself #9946

Closed
ahmedshaaban1 opened this issue Jan 13, 2025 · 6 comments
Closed

Comments

@ahmedshaaban1
Copy link

What is your issue?

All,
I am opening NetCDF files from Copernicus data center using xarray version 2024-11-0, using open_dataset function as the following:

import xarray as xr
file1=xr.open_dataset("2021-04.nc")
tem  = file1['t2m']

The netcdf file is available on the box, the reader can also download any file sample from the aforementioned data center.

Although the file size is 16.6 Mb, tem variable seems to take double the size of the actual file as could be seen below (end of the first line) or monitored by using the top command

<xarray.DataArray 't2m' (valid_time: 30, latitude: 411, longitude: 791)> Size: 39MB
[9753030 values with dtype=float32]
Coordinates:
    number      int64 8B ...
  * latitude    (latitude) float64 3kB 38.0 37.9 37.8 37.7 ... -2.8 -2.9 -3.0
  * longitude   (longitude) float64 6kB -18.0 -17.9 -17.8 ... 60.8 60.9 61.0
  * valid_time  (valid_time) datetime64[ns] 240B 2021-04-01 ... 2021-04-30
Attributes: (12/32)
    GRIB_paramId:                             167
    GRIB_dataType:                            fc
    GRIB_numberOfPoints:                      325101
    GRIB_typeOfLevel:                         surface
    GRIB_stepUnits:                           1
    GRIB_stepType:                            instant
                                      ...
    GRIB_totalNumber:                         0
    GRIB_units:                               K
    long_name:                                2 metre temperature
    units:                                    K
    standard_name:                            unknown
    GRIB_surface:                             0.0

Any idea why xarray uses all that memory. This is not problematic for small files, but it is too problematic for large files and for heavy computation when many copies of the same variable are created.

I can use file1[t2m].astype(‘float16’), which reduces the size to half, but I found that most values are rounded to the first decimal, so I am losing actual data. I want to read the actual data without having to use memory beyond the size of the data file.

This is how the data looks like when being read as float 32

<xarray.DataArray 't2m' (valid_time: 30)> Size: 120B
array([293.87134, 296.0669 , 299.4065 , 302.60474, 305.29443, 306.87646,
       301.10645, 302.47388, 299.23267, 294.26587, 295.239  , 299.19238,
       302.20923, 307.48193, 307.2202 , 310.6953 , 315.64746, 312.76416,
       305.2173 , 299.25488, 299.9475 , 302.3435 , 306.32422, 312.75342,
       299.99878, 300.59155, 303.36475, 307.11768, 308.49292, 310.6853 ],
      dtype=float32)
Coordinates:

and this is how it looks like under float 16

<xarray.DataArray 't2m' (valid_time: 30)> Size: 60B
array([293.8, 296. , 299.5, 302.5, 305.2, 307. , 301. , 302.5, 299.2,
       294.2, 295.2, 299.2, 302.2, 307.5, 307.2, 310.8, 315.8, 312.8,
       305.2, 299.2, 300. , 302.2, 306.2, 312.8, 300. , 300.5, 303.2,
       307. , 308.5, 310.8], dtype=float16)

Moreover, when I dump the data to the RAM and trace the amount the memory being used, it several fold the actuall size of the file data.

import psutil
process = psutil.Process()
print(“memory used in MB=", process.memory_info().rss / 1024**2)
tem.data
print(“memory used in MB=", process.memory_info().rss / 1024**2)

NB: I posted this question on Stackoverflow but have not received any response.
Thanks

@ahmedshaaban1 ahmedshaaban1 added the needs triage Issue that has not been reviewed by xarray team member label Jan 13, 2025
@kmuehlbauer
Copy link
Contributor

@ahmedshaaban1 The data in the file is compressed:

This is an excerpt of the h5dump -Hp 2021-04.nc:

DATASET "t2m" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 30, 411, 791 ) / ( 30, 411, 791 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 15, 206, 396 )
         SIZE 12481395 (3.126:1 COMPRESSION)
      }
      FILTERS {
         PREPROCESSING SHUFFLE
         COMPRESSION DEFLATE { LEVEL 1 }
      }
      ...

@kmuehlbauer kmuehlbauer added usage question and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 13, 2025
@kmuehlbauer
Copy link
Contributor

If your datasets are very large you might think of using dask to enable larger-than-memory computing.

See https://tutorial.xarray.dev/intermediate/xarray_and_dask.html for some examples.

@ahmedshaaban1
Copy link
Author

ahmedshaaban1 commented Jan 13, 2025

Nice catch..
I revised the NetCDF attributes and found no mention of any compression used. I thought that this gonna be explicitly declared.

@kmuehlbauer
Copy link
Contributor

I revised the NetCDF attributes and found no mention of any compression used.

Yes, this is not seen in the normal attributes (for xarray):

ds = xr.open_dataset("2021-04.nc")
ds["t2m"].encoding
{'dtype': dtype('float32'),
 'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (15, 206, 396),
 'preferred_chunks': {'valid_time': 15, 'latitude': 206, 'longitude': 396},
 'source': 'xarray/2021-04.nc',
 'original_shape': (30, 411, 791),
 '_FillValue': np.float32(nan),
 'coordinates': 'number'}

We see that shuffle and zlib is True with complevel is set to 1. The dataset was created with HDF5 compression filters, not using CF Convention packed data.

For ncdump with -s:

!ncdump -hs 2021-04.nc
variables:
	float t2m(valid_time, latitude, longitude) ;
		t2m:_FillValue = NaNf ;
		...
		t2m:_Storage = "chunked" ;
		t2m:_ChunkSizes = 15, 206, 396 ;
		t2m:_Shuffle = "true" ;
		t2m:_DeflateLevel = 1 ;
		t2m:_Endianness = "little" ;

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 13, 2025

@ahmedshaaban1 Let us know, if this helps you with getting along with your data! You are encouraged to self-answer on SO using the information from this issue. If there is nothing more to ponder on, can we close this issue?

@ahmedshaaban1
Copy link
Author

yes, I am good … thanks ..
You can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants