-
Notifications
You must be signed in to change notification settings - Fork 778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul Zip end-of-data marker parsing #2042
Conversation
This significantly changes how end-of-data markers are parsed. In particular, the spec allows the end-of-data marker to have either 32-bit or 64-bit size values, and there is basically no indication which is being used. (The spec mentions "Zip64 mode" in many places, but there is no definitive way for a Zip reader to know whether the writer is using this mode or not. My mis-reading of another part of the spec caused me to believe that the Zip64 Extra Data field was such a marker, but I've been patiently corrected. ;-) So a Zip reader basically just has to guess: Try every possible end-of-data marker format and accept it if any of the four possible forms is correct. In libarchive's case, this required some non-trivial additional refactoring to ensure that the CRC32, compressed size, and uncompressed size statistics are always updated _before_ we need to look for an end-of-data marker. This generally follows the strategy outlined by Mark Adler for his `sunzip` streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry. Of course, the probability that more than one end-of-data marker matches is so small that this is almost certainly a non-issue in practice. While testing this, I played with pmqs/zipdetails which pointed out a discrepancy in how libarchive writes the `UT` extra field. I folded a fix for that in here as well. Resolves libarchive#1834 TODO: It would be nice to augment the test suite with some static files created by Java's implementation to verify that we can read those when they hold entries of +/- 4GiB. The existing `test_write_format_zip_large` uses an ad hoc RLE encoding trick to exercise writing and reading back multi-gigabyte entries. I wonder if that could be generalized to support deflate-compressed Zip data stored in test files?
CC: @michalc |
Does this make the error checking for this file ever so slightly less robust? It could match sizes and CRC, when actually it shouldn't have? |
It probably does, though if so, it's a very tiny effect, and it's inherent to the Zip design in any case. For example, one ambiguity here is that the end-of-data marker is allowed to start with a "PK78" signature (0x08074B50) or it can start with the 32-bit CRC value. If the CRC happens to be exactly that value (a 1 in 2^32 possibility) at the same time that several other things happen to match exactly, that could lead to a mis-interpretation here. But all the effects I can come up with seem to be in this same range, so even when you add them all up, I don't see a significantly increased chance of a false-positive acceptance. |
So long as the zip file has no gaps (the components like local entries and the central directory are adjacent), then picking the longest data descriptor that works unambiguously determines the data descriptor type, and there is zero increase in false positives. Unfortunately though, picking the shortest matching data descriptor that works does not. |
CC: @jvreelanda |
I went back and re-read Mark Adler's justification for preferring longest match. I'm convinced that his strategy does in fact always do the right thing if there are no errors in the archive and there is no padding or garbage between entries.
@madler Thanks for pushing on this. I went back and studied your arguments in |
This significantly changes how end-of-data markers are parsed.
In particular, the spec allows the end-of-data marker to have either 32-bit or 64-bit size values, and there is basically no indication which is being used. (The spec mentions "Zip64 mode" in many places, but there is no definitive way for a Zip reader to know whether the writer is using this mode or not. My mis-reading of another part of the spec caused me to believe that the Zip64 Extra Data field was such a marker, but I've been patiently corrected. ;-)
So a Zip reader just has to guess: Try every possible end-of-data marker format and accept it if any of the four possible forms is correct. In libarchive's case, this required some non-trivial additional refactoring to ensure that the CRC32, compressed size, and uncompressed size statistics are always updated before we need to look for an end-of-data marker.
This generally follows the strategy outlined by Mark Adler for his
sunzip
streaming unzip implementation, except that here I accept the shortest end-of-data marker that matches, rather than the longest. Since libarchive has a pretty robust re-sync mechanism for skipping garbage data between entries, accepting the shortest helps ensure that we never overshoot the start of the next entry. Of course, the probability that more than one end-of-data marker matches is so small that this is almost certainly a non-issue in practice. (The more complex question is what to do when none of the formats matches exactly: I've chosen to interpret the "most matching" case but not to consume any bytes at all, so a resync for the next entry will start from the first byte of the alleged end-of-data marker.)While testing this, I played with pmqs/zipdetails which pointed out a discrepancy in how libarchive writes the
UT
extra field. I folded a fix for that in here as well.Resolves #1834
TODO: It would be nice to augment the test suite with some static files created by Java's implementation to verify that we can read those when they hold entries of +/- 4GiB. The existing
test_write_format_zip_large
uses an ad hoc RLE encoding trick to exercise writing and reading back multi-gigabyte entries. I wonder if that could be generalized to support deflate-compressed Zip data stored in test files?