Overhaul Zip end-of-data marker parsing #2042

kientzle · 2023-12-31T00:06:09Z

This significantly changes how end-of-data markers are parsed.

In particular, the spec allows the end-of-data marker to have either 32-bit or 64-bit size values, and there is basically no indication which is being used. (The spec mentions "Zip64 mode" in many places, but there is no definitive way for a Zip reader to know whether the writer is using this mode or not. My mis-reading of another part of the spec caused me to believe that the Zip64 Extra Data field was such a marker, but I've been patiently corrected. ;-)

So a Zip reader just has to guess: Try every possible end-of-data marker format and accept it if any of the four possible forms is correct. In libarchive's case, this required some non-trivial additional refactoring to ensure that the CRC32, compressed size, and uncompressed size statistics are always updated before we need to look for an end-of-data marker.

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry. Of course, the probability that more than one end-of-data marker matches is so small that this is almost certainly a non-issue in practice. (The more complex question is what to do when none of the formats matches exactly: I've chosen to interpret the "most matching" case but not to consume any bytes at all, so a resync for the next entry will start from the first byte of the alleged end-of-data marker.)

While testing this, I played with pmqs/zipdetails which pointed out a discrepancy in how libarchive writes the UT extra field. I folded a fix for that in here as well.

Resolves #1834

TODO: It would be nice to augment the test suite with some static files created by Java's implementation to verify that we can read those when they hold entries of +/- 4GiB. The existing test_write_format_zip_large uses an ad hoc RLE encoding trick to exercise writing and reading back multi-gigabyte entries. I wonder if that could be generalized to support deflate-compressed Zip data stored in test files?

This significantly changes how end-of-data markers are parsed. In particular, the spec allows the end-of-data marker to have either 32-bit or 64-bit size values, and there is basically no indication which is being used. (The spec mentions "Zip64 mode" in many places, but there is no definitive way for a Zip reader to know whether the writer is using this mode or not. My mis-reading of another part of the spec caused me to believe that the Zip64 Extra Data field was such a marker, but I've been patiently corrected. ;-) So a Zip reader basically just has to guess: Try every possible end-of-data marker format and accept it if any of the four possible forms is correct. In libarchive's case, this required some non-trivial additional refactoring to ensure that the CRC32, compressed size, and uncompressed size statistics are always updated _before_ we need to look for an end-of-data marker. This generally follows the strategy outlined by Mark Adler for his `sunzip` streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry. Of course, the probability that more than one end-of-data marker matches is so small that this is almost certainly a non-issue in practice. While testing this, I played with pmqs/zipdetails which pointed out a discrepancy in how libarchive writes the `UT` extra field. I folded a fix for that in here as well. Resolves libarchive#1834 TODO: It would be nice to augment the test suite with some static files created by Java's implementation to verify that we can read those when they hold entries of +/- 4GiB. The existing `test_write_format_zip_large` uses an ad hoc RLE encoding trick to exercise writing and reading back multi-gigabyte entries. I wonder if that could be generalized to support deflate-compressed Zip data stored in test files?

kientzle · 2023-12-31T00:06:52Z

CC: @madler @pmqs

kientzle · 2023-12-31T00:08:42Z

CC: @michalc

libarchive/test/test_write_format_zip64_stream.c

libarchive/test/test_write_format_zip_stream.c

michalc · 2023-12-31T10:36:34Z

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry.

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

kientzle · 2023-12-31T19:05:44Z

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

It probably does, though if so, it's a very tiny effect, and it's inherent to the Zip design in any case.

For example, one ambiguity here is that the end-of-data marker is allowed to start with a "PK78" signature (0x08074B50) or it can start with the 32-bit CRC value. If the CRC happens to be exactly that value (a 1 in 2^32 possibility) at the same time that several other things happen to match exactly, that could lead to a mis-interpretation here. But all the effects I can come up with seem to be in this same range, so even when you add them all up, I don't see a significantly increased chance of a false-positive acceptance.

madler · 2023-12-31T19:50:29Z

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry.

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

So long as the zip file has no gaps (the components like local entries and the central directory are adjacent), then picking the longest data descriptor that works unambiguously determines the data descriptor type, and there is zero increase in false positives. Unfortunately though, picking the shortest matching data descriptor that works does not.

kientzle · 2024-01-05T01:09:12Z

CC: @jvreelanda

I went back and re-read Mark Adler's justification for preferring longest match. I'm convinced that his strategy does in fact always do the right thing if there are no errors in the archive and there is no padding or garbage between entries.

kientzle · 2024-01-14T03:47:05Z

@madler Thanks for pushing on this. I went back and studied your arguments in sunzip again and I now agree with you. I've changed this to accept the longest match.

kientzle mentioned this pull request Dec 31, 2023

ZIP created by Java with a member of exactly 4294967295 bytes long shows "ZIP uncompressed data is wrong size" error #1834

Closed

github-advanced-security bot found potential problems Dec 31, 2023

View reviewed changes

libarchive/test/test_write_format_zip64_stream.c Dismissed Show dismissed Hide dismissed

libarchive/test/test_write_format_zip_stream.c Dismissed Show dismissed Hide dismissed

kientzle merged commit 8acb738 into libarchive:master Mar 24, 2024
18 of 22 checks passed

kientzle deleted the kientzle-zip64 branch July 6, 2024 22:45

kientzle mentioned this pull request Sep 15, 2024

Zip file entry size is incorrectly determined in versions >= v3.7.3 #2324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul Zip end-of-data marker parsing #2042

Overhaul Zip end-of-data marker parsing #2042

kientzle commented Dec 31, 2023 •

edited

Loading

kientzle commented Dec 31, 2023

kientzle commented Dec 31, 2023

michalc commented Dec 31, 2023

kientzle commented Dec 31, 2023

madler commented Dec 31, 2023 •

edited

Loading

kientzle commented Jan 5, 2024

kientzle commented Jan 14, 2024

Overhaul Zip end-of-data marker parsing #2042

Overhaul Zip end-of-data marker parsing #2042

Conversation

kientzle commented Dec 31, 2023 • edited Loading

kientzle commented Dec 31, 2023

kientzle commented Dec 31, 2023

michalc commented Dec 31, 2023

kientzle commented Dec 31, 2023

madler commented Dec 31, 2023 • edited Loading

kientzle commented Jan 5, 2024

kientzle commented Jan 14, 2024

kientzle commented Dec 31, 2023 •

edited

Loading

madler commented Dec 31, 2023 •

edited

Loading