@base64d doesn't match based64 -d output #1931

mterron · 2019-06-21T03:30:10Z

Describe the bug
The output of jq's @base64d does not match the output of base64 -d for the same string

To Reproduce

$ echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | jq -Rr '@base64d' | xxd -p -c64
57432c2f68706f20efbfbd1acd8032efbfbd4029efbfbdefbfbd3e5f0eefbfbd7e0e580122efbfbdefbfbd19550a
$ echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | base64 -d | xxd -p -c64
57432c2f68706f209c1acd8032b14029a4a33e5f0eb97e0e580122a7e89d1955
$ echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | base64 -d | hexdump -C
00000000  57 43 2c 2f 68 70 6f 20  9c 1a cd 80 32 b1 40 29  |WC,/hpo ....2.@)|
00000010  a4 a3 3e 5f 0e b9 7e 0e  58 01 22 a7 e8 9d 19 55  |..>_..~.X."....U|
$ echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | jq -Rr '@base64d' | hexdump -C
00000000  57 43 2c 2f 68 70 6f 20  ef bf bd 1a cd 80 32 ef  |WC,/hpo ......2.|
00000010  bf bd 40 29 ef bf bd ef  bf bd 3e 5f 0e ef bf bd  |..@)......>_....|
00000020  7e 0e 58 01 22 ef bf bd  ef bf bd 19 55 0a        |~.X.".......U.|

Expected behavior
The outputs of echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | jq -Rr '@base64d' | xxd -p -c64 and echo "V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU=" | base64 -d | xxd -p -c64 should be equal.

Environment (please complete the following information):

OS and Version: Alpine Linux
jq version: jq-master-v3.8.0-3651-g18d55b6bda

Additional context
Seems to be related to binary values being encoded.

The text was updated successfully, but these errors were encountered:

pkoppstein · 2019-06-21T06:45:48Z

In brief, @base64d has a documented limitation:

Note: If the decoded string is not UTF-8, the results are undefined.

This could be restated as follows:

 Let $B be an arbitrary base64 string, then `$B | @base64d` is undefined
 if `base64 -D <<< $B` is not a valid UTF-8 string.

Here base64 refers to the command-line program of that name, and I'm assuming that base64 -D acts as the inverse of base64.

Using the moreutils program isutf8 to test this condition, we see
that the given base64 string does NOT satisfy it:

$ S="V0MsL2hwbyCcGs2AMrFAKaSjPl8OuX4OWAEip+idGVU="
$ base64 -D <<< "$S" | xxd -p
57432c2f68706f209c1acd8032b14029a4a33e5f0eb97e0e580122a7e89d
1955
$ base64 -D <<< "$S" | isutf8
(standard input): line 1, char 8, byte 8: Expecting bytes in the following ranges: 00..7F C2..F4.

mterron · 2019-06-21T06:59:33Z

Fair enough. I'd make the suggestion to document that the results of @base64d is incorrect instead of merely undefined when decoding non UTF-8 strings.

Having put this to rest, is there any plans to have a way to "shell out" to call external utilities to aid in processing this edge cases?

pkoppstein · 2019-06-21T07:52:27Z

@mterron - I believe the author or authors of this particular section of the official manual did not want to commit to any particular behavior at the time of writing, no doubt because, as you say, there is much to be said for raising an error condition.

As for shelling out -- yes, there are plans to support this (see e.g. #147 and #1614), but it won't help much in the present case, for the reason already stated.

mterron · 2019-06-21T08:08:15Z

@pkoppstein give a way to shell out I'll make it work :)

Thanks!

pkoppstein · 2019-06-21T09:22:34Z

@mterron wrote:

I'll make it work :)

It appears you are not quite grasping the fact that the implementation of the shell-out function will (of necessity) be designed to prevent what I understand you want to do.

The "j" in jq can be understood as a commitment that, with two exceptions, every jq filter should produce strictly valid JSON, the exceptions being the values for NaN and Infinite, but even these two values are ultimately converted to valid JSON on output (e.g. echo Nan | jq . #=> null).

This is not to say that every version of jq is guaranteed to reject non-JSON strings, but that's not for want of trying :-)

mterron · 2019-06-21T09:27:26Z

Oh I understand very well. xxd will convert the binary back to an hexadecimal string that is valid json

pkoppstein · 2019-06-21T14:43:14Z

@mterron - Excellent. You might want to mention your use case on one of the shell-out tracking issues.

itchyny · 2020-05-14T07:28:33Z

I think this is worth fixing and it does not break compatibility. It is useful if we can decode the image data from aws ec2 get-console-screenshot with jq only.

pkoppstein · 2021-11-21T10:04:35Z

Congratulations to @itchyny on gojq's @base64d, which passes the test using https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html with flying colors. With B as a local copy of this file:

$ base64 < B > B.base64
$ diff <(gojq -Rrj '@base64d' B.base64) B
$

GwynethLlewelyn · 2022-08-15T23:43:04Z

@pkoppstein does that mean that the Go version of jq is now beating the 'original' jq at its own game? I have made a very simple test, and gojq surprisingly seemed to be much faster at decoding 118K of base64-encoded raw binary data, compared to jq (which, of course, will produce garbage).

Now I'd really love to see a benchmark comparing the two :) [note: I'm well aware that gojq does not fully implement everything that jq does...]

Just tested it on macOS. gojq worked beautifully.

pkoppstein · 2022-08-16T02:03:36Z

@GwynethLlewelyn - gojq is unquestionably better than jq in several important respects --
not least that it is being actively maintained -- but unfortunately
gojq has one major intrinsic drawback compared to jq, namely that in
many cases that jq can handle, it simply runs out of memory. My
understanding is that this is, in effect, by design -- that is, the
problem apparently cannot be addressed within the scope of gojq's
current design.

My experience has been that, in general, gojq tends to trade memory
for speed, at least compared to jq, and the following statistics
suggest that might be the case for @base64d too:

/usr/bin/time -lp gojq -Rrj '@base64d'  B.base64 > /dev/null
             3,436,544  maximum resident set size
             1,499,136  peak memory footprint

/usr/bin/time -lp jq -Rrj '@base64d'  B.base64 > /dev/null
             1,937,408  maximum resident set size
             1,204,224  peak memory footprint

There is also the matter of retaining the order of keys within
objects, which for some users and applications is unimportant, but
that is not always the case.

hachi · 2024-10-15T01:09:26Z

As a user of jq I'm sorely disappointed that a simple note on the limitation of this function isn't in the manpage with over 5 years of lead time.

I spent a good couple hours debugging why my base64 decoded strings were getting corrupted only to find that it's a known limitation of jq.

There is a note in the manpage about how interpolation behaves and nothing else.

pkoppstein · 2024-10-15T01:40:25Z

@hachi - In the section on @base64d, the man page (https://jqlang.github.io/jq/manual/) says:

The inverse of @base64, input is decoded as specified by RFC 4648. Note: If the decoded string is not UTF-8, the results are undefined.

mterron mentioned this issue Jun 21, 2019

base64 decoding function #47

Closed

Maxdamantus mentioned this issue May 20, 2021

Support binary strings, preserve UTF-8 and UTF-16 errors #2314

Open

itchyny added the bug label Jun 3, 2023

D3vil0p3r mentioned this issue Jun 8, 2023

[Request] gojq chaotic-aur/packages#2543

Closed

itchyny mentioned this issue Mar 9, 2024

@base64d fails on pdf files #3061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@base64d doesn't match based64 -d output #1931

@base64d doesn't match based64 -d output #1931

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019 •

edited

Loading

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019 •

edited

Loading

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019

itchyny commented May 14, 2020

pkoppstein commented Nov 21, 2021

GwynethLlewelyn commented Aug 15, 2022

pkoppstein commented Aug 16, 2022

hachi commented Oct 15, 2024

pkoppstein commented Oct 15, 2024

@base64d doesn't match based64 -d output #1931

@base64d doesn't match based64 -d output #1931

Comments

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019 • edited Loading

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019 • edited Loading

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019

mterron commented Jun 21, 2019

pkoppstein commented Jun 21, 2019

itchyny commented May 14, 2020

pkoppstein commented Nov 21, 2021

GwynethLlewelyn commented Aug 15, 2022

pkoppstein commented Aug 16, 2022

hachi commented Oct 15, 2024

pkoppstein commented Oct 15, 2024

pkoppstein commented Jun 21, 2019 •

edited

Loading

pkoppstein commented Jun 21, 2019 •

edited

Loading