Xarray.open_dateset uses more than double the size of the file itself #9946

ahmedshaaban1 · 2025-01-13T14:26:52Z

What is your issue?

All,
I am opening NetCDF files from Copernicus data center using xarray version 2024-11-0, using open_dataset function as the following:

import xarray as xr
file1=xr.open_dataset("2021-04.nc")
tem  = file1['t2m']

The netcdf file is available on the box, the reader can also download any file sample from the aforementioned data center.

Although the file size is 16.6 Mb, tem variable seems to take double the size of the actual file as could be seen below (end of the first line) or monitored by using the top command

<xarray.DataArray 't2m' (valid_time: 30, latitude: 411, longitude: 791)> Size: 39MB
[9753030 values with dtype=float32]
Coordinates:
    number      int64 8B ...
  * latitude    (latitude) float64 3kB 38.0 37.9 37.8 37.7 ... -2.8 -2.9 -3.0
  * longitude   (longitude) float64 6kB -18.0 -17.9 -17.8 ... 60.8 60.9 61.0
  * valid_time  (valid_time) datetime64[ns] 240B 2021-04-01 ... 2021-04-30
Attributes: (12/32)
    GRIB_paramId:                             167
    GRIB_dataType:                            fc
    GRIB_numberOfPoints:                      325101
    GRIB_typeOfLevel:                         surface
    GRIB_stepUnits:                           1
    GRIB_stepType:                            instant
                                      ...
    GRIB_totalNumber:                         0
    GRIB_units:                               K
    long_name:                                2 metre temperature
    units:                                    K
    standard_name:                            unknown
    GRIB_surface:                             0.0

Any idea why xarray uses all that memory. This is not problematic for small files, but it is too problematic for large files and for heavy computation when many copies of the same variable are created.

I can use file1[t2m].astype(‘float16’), which reduces the size to half, but I found that most values are rounded to the first decimal, so I am losing actual data. I want to read the actual data without having to use memory beyond the size of the data file.

This is how the data looks like when being read as float 32

<xarray.DataArray 't2m' (valid_time: 30)> Size: 120B
array([293.87134, 296.0669 , 299.4065 , 302.60474, 305.29443, 306.87646,
       301.10645, 302.47388, 299.23267, 294.26587, 295.239  , 299.19238,
       302.20923, 307.48193, 307.2202 , 310.6953 , 315.64746, 312.76416,
       305.2173 , 299.25488, 299.9475 , 302.3435 , 306.32422, 312.75342,
       299.99878, 300.59155, 303.36475, 307.11768, 308.49292, 310.6853 ],
      dtype=float32)
Coordinates:

and this is how it looks like under float 16

<xarray.DataArray 't2m' (valid_time: 30)> Size: 60B
array([293.8, 296. , 299.5, 302.5, 305.2, 307. , 301. , 302.5, 299.2,
       294.2, 295.2, 299.2, 302.2, 307.5, 307.2, 310.8, 315.8, 312.8,
       305.2, 299.2, 300. , 302.2, 306.2, 312.8, 300. , 300.5, 303.2,
       307. , 308.5, 310.8], dtype=float16)

Moreover, when I dump the data to the RAM and trace the amount the memory being used, it several fold the actuall size of the file data.

import psutil
process = psutil.Process()
print(“memory used in MB=", process.memory_info().rss / 1024**2)
tem.data
print(“memory used in MB=", process.memory_info().rss / 1024**2)

NB: I posted this question on Stackoverflow but have not received any response.
Thanks

The text was updated successfully, but these errors were encountered:

kmuehlbauer · 2025-01-13T14:41:34Z

@ahmedshaaban1 The data in the file is compressed:

This is an excerpt of the h5dump -Hp 2021-04.nc:

DATASET "t2m" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 30, 411, 791 ) / ( 30, 411, 791 ) }
      STORAGE_LAYOUT {
         CHUNKED ( 15, 206, 396 )
         SIZE 12481395 (3.126:1 COMPRESSION)
      }
      FILTERS {
         PREPROCESSING SHUFFLE
         COMPRESSION DEFLATE { LEVEL 1 }
      }
      ...

kmuehlbauer · 2025-01-13T15:01:03Z

If your datasets are very large you might think of using dask to enable larger-than-memory computing.

See https://tutorial.xarray.dev/intermediate/xarray_and_dask.html for some examples.

ahmedshaaban1 · 2025-01-13T15:02:06Z

Nice catch..
I revised the NetCDF attributes and found no mention of any compression used. I thought that this gonna be explicitly declared.

kmuehlbauer · 2025-01-13T15:14:33Z

I revised the NetCDF attributes and found no mention of any compression used.

Yes, this is not seen in the normal attributes (for xarray):

ds = xr.open_dataset("2021-04.nc")
ds["t2m"].encoding

{'dtype': dtype('float32'),
 'zlib': True,
 'szip': False,
 'zstd': False,
 'bzip2': False,
 'blosc': False,
 'shuffle': True,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (15, 206, 396),
 'preferred_chunks': {'valid_time': 15, 'latitude': 206, 'longitude': 396},
 'source': 'xarray/2021-04.nc',
 'original_shape': (30, 411, 791),
 '_FillValue': np.float32(nan),
 'coordinates': 'number'}

We see that shuffle and zlib is True with complevel is set to 1. The dataset was created with HDF5 compression filters, not using CF Convention packed data.

For ncdump with -s:

!ncdump -hs 2021-04.nc

variables:
	float t2m(valid_time, latitude, longitude) ;
		t2m:_FillValue = NaNf ;
		...
		t2m:_Storage = "chunked" ;
		t2m:_ChunkSizes = 15, 206, 396 ;
		t2m:_Shuffle = "true" ;
		t2m:_DeflateLevel = 1 ;
		t2m:_Endianness = "little" ;

kmuehlbauer · 2025-01-13T15:50:39Z

@ahmedshaaban1 Let us know, if this helps you with getting along with your data! You are encouraged to self-answer on SO using the information from this issue. If there is nothing more to ponder on, can we close this issue?

ahmedshaaban1 · 2025-01-13T16:00:54Z

yes, I am good … thanks ..
You can close this issue

ahmedshaaban1 added the needs triage Issue that has not been reviewed by xarray team member label Jan 13, 2025

kmuehlbauer added usage question and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 13, 2025

kmuehlbauer closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray.open_dateset uses more than double the size of the file itself #9946

Xarray.open_dateset uses more than double the size of the file itself #9946

ahmedshaaban1 commented Jan 13, 2025

kmuehlbauer commented Jan 13, 2025

kmuehlbauer commented Jan 13, 2025

ahmedshaaban1 commented Jan 13, 2025 •

edited

Loading

kmuehlbauer commented Jan 13, 2025

kmuehlbauer commented Jan 13, 2025 •

edited

Loading

ahmedshaaban1 commented Jan 13, 2025

Xarray.open_dateset uses more than double the size of the file itself #9946

Xarray.open_dateset uses more than double the size of the file itself #9946

Comments

ahmedshaaban1 commented Jan 13, 2025

What is your issue?

kmuehlbauer commented Jan 13, 2025

kmuehlbauer commented Jan 13, 2025

ahmedshaaban1 commented Jan 13, 2025 • edited Loading

kmuehlbauer commented Jan 13, 2025

kmuehlbauer commented Jan 13, 2025 • edited Loading

ahmedshaaban1 commented Jan 13, 2025

ahmedshaaban1 commented Jan 13, 2025 •

edited

Loading

kmuehlbauer commented Jan 13, 2025 •

edited

Loading