-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray.open_dateset uses more than double the size of the file itself #9946
Comments
@ahmedshaaban1 The data in the file is compressed: This is an excerpt of the
|
If your datasets are very large you might think of using See https://tutorial.xarray.dev/intermediate/xarray_and_dask.html for some examples. |
Nice catch.. |
Yes, this is not seen in the normal attributes (for xarray): ds = xr.open_dataset("2021-04.nc")
ds["t2m"].encoding {'dtype': dtype('float32'),
'zlib': True,
'szip': False,
'zstd': False,
'bzip2': False,
'blosc': False,
'shuffle': True,
'complevel': 1,
'fletcher32': False,
'contiguous': False,
'chunksizes': (15, 206, 396),
'preferred_chunks': {'valid_time': 15, 'latitude': 206, 'longitude': 396},
'source': 'xarray/2021-04.nc',
'original_shape': (30, 411, 791),
'_FillValue': np.float32(nan),
'coordinates': 'number'} We see that For
|
@ahmedshaaban1 Let us know, if this helps you with getting along with your data! You are encouraged to self-answer on SO using the information from this issue. If there is nothing more to ponder on, can we close this issue? |
yes, I am good … thanks .. |
What is your issue?
All,
I am opening NetCDF files from Copernicus data center using xarray version 2024-11-0, using open_dataset function as the following:
The netcdf file is available on the box, the reader can also download any file sample from the aforementioned data center.
Although the file size is 16.6 Mb, tem variable seems to take double the size of the actual file as could be seen below (end of the first line) or monitored by using the top command
Any idea why xarray uses all that memory. This is not problematic for small files, but it is too problematic for large files and for heavy computation when many copies of the same variable are created.
I can use
file1[t2m].astype(‘float16’)
, which reduces the size to half, but I found that most values are rounded to the first decimal, so I am losing actual data. I want to read the actual data without having to use memory beyond the size of the data file.This is how the data looks like when being read as float 32
and this is how it looks like under float 16
Moreover, when I dump the data to the RAM and trace the amount the memory being used, it several fold the actuall size of the file data.
NB: I posted this question on Stackoverflow but have not received any response.
Thanks
The text was updated successfully, but these errors were encountered: