Skip to content

Scala Common Enrich: Base64 decoding does not specify UTF-8 charset #1403

Closed
@alexanderdean

Description

When Base64-encoded JSONs (unstructured events and custom contexts) are decoded, the new String() constructor used does not specify UTF-8 as the encoding. Unfortunately, Hadoop on EMR defaults to US_ASCII, causing non-Latin characters to be corrupted.

For any future reader: I initially thought the issue was that Hadoop was picking up an old version of Commons Base64, but in fact that was disprovable in two ways:

  1. The Base64 API is different before URLsafe Base64 support was added. It would have thrown a runtime exception

  2. When running an early (pre-URLsafe) version of Base64 against a URLsafe string, the decoded string is corrupted from the first non-URLsafe character onwards, not just the Unicode characters

    Here is a demo:

scala> import org.apache.commons.codec.binary.Base64
import org.apache.commons.codec.binary.Base64

scala> val decoder = new Base64(true)
decoder: org.apache.commons.codec.binary.Base64 = org.apache.commons.codec.binary.Base64@2685334e

scala> val decodedBytes = decoder.decode("eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy91bnN0cnVjdF9ldmVudC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy13ZWJzaXRlL3NpZ251cF9mb3JtX3N1Ym1pdHRlZC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJuYW1lIjoizqfOsc-BzrnPhM6vzr3OtyBORVcgVW5pY29kZSB0ZXN0IiwiZW1haWwiOiJhbGV4K3Rlc3RAc25vd3Bsb3dhbmFseXRpY3MuY29tIiwiY29tcGFueSI6IlNQIiwiZXZlbnRzUGVyTW9udGgiOiI8IDEgbWlsbGlvbiIsInNlcnZpY2VUeXBlIjoidW5zdXJlIn19fQ")
decodedBytes: Array[Byte] = Array(123, 34, 115, 99, 104, 101, 109, 97, 34, 58, 34, 105, 103, 108, 117, 58, 99, 111, 109, 46, 115, 110, 111, 119, 112, 108, 111, 119, 97, 110, 97, 108, 121, 116, 105, 99, 115, 46, 115, 110, 111, 119, 112, 108, 111, 119, 47, 117, 110, 115, 116, 114, 117, 99, 116, 95, 101, 118, 101, 110, 116, 47, 106, 115, 111, 110, 115, 99, 104, 101, 109, 97, 47, 49, 45, 48, 45, 48, 34, 44, 34, 100, 97, 116, 97, 34, 58, 123, 34, 115, 99, 104, 101, 109, 97, 34, 58, 34, 105, 103, 108, 117, 58, 99, 111, 109, 46, 115, 110, 111, 119, 112, 108, 111, 119, 97, 110, 97, 108, 121, 116, 105, 99, 115, 46, 115, 110, 111, 119, 112, 108, 111, 119, 45, 119, 101, 98, 115, 105, 116, 101, 47, 115, 105, 103, 110, 117, 112, 95, 102, 111, 114, 109, 95, 115, 117, 98, 109, 105, 116, 116, 101, 100,...
scala> new String(decodedBytes)
res0: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"Χαριτίνη NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}

scala> new String(decodedBytes, java.nio.charset.StandardCharsets.UTF_8)
res1: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"Χαριτίνη NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}

scala> new String(decodedBytes, java.nio.charset.StandardCharsets.US_ASCII)
res2: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"���������������� NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}

/cc @epantera, @dstendardi

Activity

self-assigned this
on Feb 9, 2015
added a commit that references this issue on Feb 9, 2015
3569249
epantera

epantera commented on Feb 9, 2015

@epantera

You Rock, thanks @alexanderdean

added a commit that references this issue on Feb 9, 2015
de40ae2
added 3 commits that reference this issue on Feb 9, 2015
c81948f
2f1d726
5bf502b
added a commit that references this issue on May 29, 2020
5ff80ab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Scala Common Enrich: Base64 decoding does not specify UTF-8 charset · Issue #1403 · snowplow/snowplow