Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul Zip end-of-data marker parsing #2042

Merged
merged 2 commits into from
Mar 24, 2024

Conversation

kientzle
Copy link
Contributor

@kientzle kientzle commented Dec 31, 2023

This significantly changes how end-of-data markers are parsed.

In particular, the spec allows the end-of-data marker to have either 32-bit or 64-bit size values, and there is basically no indication which is being used. (The spec mentions "Zip64 mode" in many places, but there is no definitive way for a Zip reader to know whether the writer is using this mode or not. My mis-reading of another part of the spec caused me to believe that the Zip64 Extra Data field was such a marker, but I've been patiently corrected. ;-)

So a Zip reader just has to guess: Try every possible end-of-data marker format and accept it if any of the four possible forms is correct. In libarchive's case, this required some non-trivial additional refactoring to ensure that the CRC32, compressed size, and uncompressed size statistics are always updated before we need to look for an end-of-data marker.

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry. Of course, the probability that more than one end-of-data marker matches is so small that this is almost certainly a non-issue in practice. (The more complex question is what to do when none of the formats matches exactly: I've chosen to interpret the "most matching" case but not to consume any bytes at all, so a resync for the next entry will start from the first byte of the alleged end-of-data marker.)

While testing this, I played with pmqs/zipdetails which pointed out a discrepancy in how libarchive writes the UT extra field. I folded a fix for that in here as well.

Resolves #1834

TODO: It would be nice to augment the test suite with some static files created by Java's implementation to verify that we can read those when they hold entries of +/- 4GiB. The existing test_write_format_zip_large uses an ad hoc RLE encoding trick to exercise writing and reading back multi-gigabyte entries. I wonder if that could be generalized to support deflate-compressed Zip data stored in test files?

This significantly changes how end-of-data markers are parsed.

In particular, the spec allows the end-of-data marker to have
either 32-bit or 64-bit size values, and there is basically no
indication which is being used.  (The spec mentions "Zip64 mode"
in many places, but there is no definitive way for a Zip reader
to know whether the writer is using this mode or not.  My mis-reading
of another part of the spec caused me to believe that the Zip64 Extra
Data field was such a marker, but I've been patiently corrected. ;-)

So a Zip reader basically just has to guess: Try every possible
end-of-data marker format and accept it if any of the four possible
forms is correct.  In libarchive's case, this required some non-trivial
additional refactoring to ensure that the CRC32, compressed size, and
uncompressed size statistics are always updated _before_ we need to
look for an end-of-data marker.

This generally follows the strategy outlined by Mark Adler for his
`sunzip` streaming unzip implementation, except that here I accept the
shortest end-of-data marker that matches, rather than the longest.
Since libarchive has a pretty robust re-sync mechanism for skipping
garbage data between entries, accepting the shortest helps ensure that
we never overshoot the start of the next entry.  Of course, the
probability that more than one end-of-data marker matches is so small
that this is almost certainly a non-issue in practice.

While testing this, I played with pmqs/zipdetails which
pointed out a discrepancy in how libarchive writes the `UT`
extra field.  I folded a fix for that in here as well.

Resolves libarchive#1834

TODO: It would be nice to augment the test suite with some static
files created by Java's implementation to verify that we can read
those when they hold entries of +/- 4GiB.  The existing
`test_write_format_zip_large` uses an ad hoc RLE encoding trick to
exercise writing and reading back multi-gigabyte entries.  I wonder if
that could be generalized to support deflate-compressed Zip data
stored in test files?
@kientzle
Copy link
Contributor Author

CC: @madler @pmqs

@kientzle
Copy link
Contributor Author

CC: @michalc

libarchive/test/test_write_format_zip64_stream.c Dismissed Show dismissed Hide dismissed
libarchive/test/test_write_format_zip_stream.c Dismissed Show dismissed Hide dismissed
@michalc
Copy link

michalc commented Dec 31, 2023

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry.

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

@kientzle
Copy link
Contributor Author

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

It probably does, though if so, it's a very tiny effect, and it's inherent to the Zip design in any case.

For example, one ambiguity here is that the end-of-data marker is allowed to start with a "PK78" signature (0x08074B50) or it can start with the 32-bit CRC value. If the CRC happens to be exactly that value (a 1 in 2^32 possibility) at the same time that several other things happen to match exactly, that could lead to a mis-interpretation here. But all the effects I can come up with seem to be in this same range, so even when you add them all up, I don't see a significantly increased chance of a false-positive acceptance.

@madler
Copy link

madler commented Dec 31, 2023

This generally follows the strategy outlined by Mark Adler for his sunzip streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry.

Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have?

So long as the zip file has no gaps (the components like local entries and the central directory are adjacent), then picking the longest data descriptor that works unambiguously determines the data descriptor type, and there is zero increase in false positives. Unfortunately though, picking the shortest matching data descriptor that works does not.

@kientzle
Copy link
Contributor Author

kientzle commented Jan 5, 2024

CC: @jvreelanda

I went back and re-read Mark Adler's justification for preferring
longest match.  I'm convinced that his strategy does in fact always do
the right thing if there are no errors in the archive and there is no
padding or garbage between entries.
@kientzle
Copy link
Contributor Author

@madler Thanks for pushing on this. I went back and studied your arguments in sunzip again and I now agree with you. I've changed this to accept the longest match.

@kientzle kientzle merged commit 8acb738 into libarchive:master Mar 24, 2024
18 of 22 checks passed
@kientzle kientzle deleted the kientzle-zip64 branch July 6, 2024 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants