-
-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with using lossless JPEG-XL (colorspace=YUV 400) #67
Comments
|
Awesome, thanks so much for your help! |
Some initial result compressing a single, full resolution, full-geo-extent satellite image (5576x3560, 8 bit per pixel grayscale) (
It's notable that "visually lossless" means Some other conversions done using imagemagick:
So, in conclusion, our hunch was correct: JPEG-XL can create the smallest lossless files, but it takes a while! But maybe that's OK if we Dask can compress in parallel. And maybe compute costs go up with, say, the square of the image size, so maybe its way faster to compress multiple small chunks? (I'm totally guessing here!) And it looks like JPEG-XL can create a file size that's 64% of the size of a bz2 compressed file (7.5 MB for bz2 vs 4.8 MB for JXL) TODO: Check if 'visually lossless' is good enough. |
reading through the jpeg-xl code in imagecodecs, it looks like all we have to do is cast our |
I'm probably doing something stupid but I'm failing to get |
OK, some initial findings about using JPEG-XL with Zarr: For To engage lossless mode, use Using lossless mode is a little fiddly. To engage lossless mode, Expanding the range of values to the range [0, 2**16-1] means that And the raw uncompressed image: which I have a hunch is because this bit of code uses SRGB if the image is uint8 else uses LinearSRGB. Using uint8 compressed image looks a lot better: Here are some results: For four timesteps:
bz2 with uint8 is very impressive, tbh. jpeg-xl uint8 is even better, of course 🙂 If the full dataset is 25TB using bz2, then it'd be more like 5 TB if using jpeg-xl lossy uint8 :) or 15 TB using uint8 bz2. A very nice thing about 5 TB is it'll fit onto a consumer 8TB SSD 🙂 For a public dataset, bz2 uint8 might be the way to go because it's easier to install. |
imagecodecs JPEG-XL can simply be installed through pip (nice!) No need for libjxl to be installed manually as well. That's great news because it means that enabling JPEG-XL compression in Python is as simple |
Some conclusions:
Detail results: For four timesteps:
Some plots of jpeg-xl, distance=0.6, effort=8, float16: The notebook for creating these plots is here |
Some notes from today's experiments... JPEG-XL compression for our pre-prepared batchesOur "pre-prepared batches" have five dimensions:
AFAICT, NetCDF cannot use the numcodecs API. And so we cannot use JPEG-XL (or any other imagecodecs) to compress data into NetCDF files. (Which is a shame because it might have been nice to use JPEG-XL to compress our pre-prepared batches.) We could use Zarr for our pre-prepared batches. But the class JpegXlFuture(JpegXl):
"""Simple hack to make the JpegXl compressor in the currently released
version of imagecodecs (version 2011.11.20) look like the version in development.
(We can't simply use the version in development because the imagecodecs author
does not develop on GitHub. The imagecodecs authors just uses GitHub as one of
several mechanisms to release imagecodecs.)
See https://github.com/cgohlke/imagecodecs/issues/31#issuecomment-1026179413
"""
codec_id = 'imagecodecs_jpegxl'
def __init__(self, lossless=None, decodingspeed=None, level=None, distance=None, *args, **kwargs):
"""
Args:
distance: Lowest settings are 0.00 or 0.01. If 0.0 then also set lossless to True.
level: DON'T SET THIS WITH THIS JpegXlFuture wrapper!
In imagecodecs version 2011.11.20, level is mapped (incorrectly) to the decoding speed tier.
Minimum is 0 (highest quality), and maximum is 4 (lowest quality). Default is 0.
decodingspeed: DON'T SET THIS WITH THIS JpegXlFuture wrapper!
"""
assert decodingspeed is None
if lossless is not None:
if lossless:
assert level is None # level must be None to enable lossless in imagecodecs 2011.11.20.
assert distance is None or distance == 0
else:
# Enable lossy compression.
level = 0 # level must be set to 0, 1, 2, 3, or 4 to enable lossy compression in imagecodecs 2011.11.20.
super().__init__(level=level, distance=distance, *args, **kwargs)
def encode(self, buf):
n_examples = buf.shape[0]
n_timesteps = buf.shape[1]
n_channels = buf.shape[-1]
assert n_examples == 1
assert n_timesteps == 1
assert n_channels == 1
buf = buf[0]
return super().encode(buf)
def decode(self, buf, out=None):
assert out is None
out = super().decode(buf, out)
out = out[np.newaxis, ...]
return out
register_codec(JpegXlFuture)
# Convert to uint8
satellite_dataset["data"] = (satellite_dataset["data"].clip(min=0, max=1023) / 4).round().astype(np.uint8)
lossless = False
distance = 0.6
effort = 8
encoding = {
"data": {
"compressor": JpegXlFuture(lossless=lossless, distance=distance, effort=effort),
},
}
chunks = dict(example=1, time_index=1, y_geostationary_index=None, x_geostationary_index=None, channels_index=1)
satellite_dataset.chunk(chunks).to_zarr(
DST_PATH / (filename.stem + ".zarr"),
mode="w",
encoding=encoding,
compute=True,
) But this takes a whole minute per batch, and you end up with lots of tiny Zarr chunks on disk (a few tens of bytes!) The JPEG-XL spec allows for multiple images within a single JPEG-XL file. But I don't think |
Using JPEG-XL for intermediate ZarrsI'm much more optimistic about using JPEG-XL for our intermediate Zarrs. (In our current Zarrs, the "real" pixel values are in the range [0, 1023], and NaNs are encoded as The slightly fiddly thing is how to encode "NaNs". JPEG-XL supports Trying to compress a When using Where I've landed is for "real" image values to be in the range original_float32["data"][:, 100:200, 100:200] = 0.025 This does result in a bit of "ringing" (for want of a better term) around the edges of the "NaN box": All values in the "NaN box" are between (The ringing is much larger amplitude if we use 0.95 as the "NaN marker". Then the values of the "NaN box" range from about Here's the code I'm using: original_float32 = original_float32.astype(np.float64) # Use float64 for maths. Then turn to float32.
# Encode "NaN" as 0.025. Encode all other values in the range [0.075, 1].
# But, after JPEG-XL compression, there is slight "ringing" around the edges.
# So, after compression, use "image < 0.05" to find NaNs.
LOWER_BOUND_FOR_REAL_PIXELS = 0.075
NAN_VALUE = 0.025
original_float32 = original_float32.clip(min=0, max=1023)
original_float32 /= 1023 * (1 + LOWER_BOUND_FOR_REAL_PIXELS)
original_float32 += LOWER_BOUND_FOR_REAL_PIXELS
original_float32 = original_float32.where(
cond=original >= 0,
other=NAN_VALUE,
)
original_float32 = original_float32.astype(np.float32)
lossless = False
distance = 0.4
effort = 8
encoding = {
"data": {
"compressor": JpegXlFuture(lossless=lossless, distance=distance, effort=effort),
},
}
# Need to set to a channel so that JpegXl interprets the first dimension as "time".
# original_float32 has chunks: `time=1, y=None, x=None`
zarr_store = original_float32.expand_dims(dim="channel", axis=-1).to_zarr(
OUTPUT_ZARR_PATH,
mode="w",
encoding=encoding,
compute=True,
) Doing this, the PSNR is a very healthy 54.37. But beware that PSNR increases if you simply brighten the image (which is what we've done!) The size on disk for my four test timesteps is a total of 690 kB, which is 24% of the size of the same image compressed using bz2 🙂 SummaryLet's use JPEG-XL, distance=0.4, effort=8, lossless=False. Encode NaNs as 0.025. Encode "real" pixel values in the range [0.075, 1]. The recover the original image with something like: # Set all values below 0.05 as NaN:
recovered_image = decompressed_image.where(decompressed_image > 0.05)
recovered_image -= LOWER_BOUND_FOR_REAL_PIXELS
recovered_image *= 1023 * (1 + LOWER_BOUND_FOR_REAL_PIXELS) Doing this gives a PSNR of 53.74. |
Detailed Description
JPEG-XL is the "new kid on the block" for image compression. And it can losslessly compress greyscale images (using colorspace YUV 400). It might be great for compressing satellite images into our Zarr datasets.
Context
See issue #45 for previous experiments and notes.
This great comparison of lossless compression using JPEG-XL, WebP, AVIF, and PNG suggests JPEG-XL wins.
Possible Implementation
imagecodecs
includes an interface to JPEG-XL.Next step is to try manually compressing images using the
cjxl
app (not ImageMagick).If that looks good then create a stand-alone little adapter library to adaptUPDATE: @cgohike has already implemented this in imagecodecs! (See comment below)imagecodecs
to be used with Zarr. Here's a super-simple little python library (just 51 lines of code!) which enables jpeg-2000 compression in Zarr using imagecodecs. Maybe it'd be possible to use the same pattern, but for JPEG-XL?To use ImageMagick for quick experiments at the commandline:
You need ImageMagick version > 7.0.10 to use JPEG-XL.
To install ImageMagick >7.0.10:
See here for how to install ImageMagick from source.
Then use 'magick' command not 'convert'.
I'll do some experiments later today or tomorrow 🙂
TODO
.jxl
(can't see how to do this and, anyway, Chrome cannot currently open jxl files)The text was updated successfully, but these errors were encountered: