Scala Common Enrich: Base64 decoding does not specify UTF-8 charset #1403
Description
When Base64-encoded JSONs (unstructured events and custom contexts) are decoded, the new String()
constructor used does not specify UTF-8 as the encoding. Unfortunately, Hadoop on EMR defaults to US_ASCII
, causing non-Latin characters to be corrupted.
For any future reader: I initially thought the issue was that Hadoop was picking up an old version of Commons Base64, but in fact that was disprovable in two ways:
-
The Base64 API is different before URLsafe Base64 support was added. It would have thrown a runtime exception
-
When running an early (pre-URLsafe) version of Base64 against a URLsafe string, the decoded string is corrupted from the first non-URLsafe character onwards, not just the Unicode characters
Here is a demo:
scala> import org.apache.commons.codec.binary.Base64
import org.apache.commons.codec.binary.Base64
scala> val decoder = new Base64(true)
decoder: org.apache.commons.codec.binary.Base64 = org.apache.commons.codec.binary.Base64@2685334e
scala> val decodedBytes = decoder.decode("eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy91bnN0cnVjdF9ldmVudC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy13ZWJzaXRlL3NpZ251cF9mb3JtX3N1Ym1pdHRlZC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJuYW1lIjoizqfOsc-BzrnPhM6vzr3OtyBORVcgVW5pY29kZSB0ZXN0IiwiZW1haWwiOiJhbGV4K3Rlc3RAc25vd3Bsb3dhbmFseXRpY3MuY29tIiwiY29tcGFueSI6IlNQIiwiZXZlbnRzUGVyTW9udGgiOiI8IDEgbWlsbGlvbiIsInNlcnZpY2VUeXBlIjoidW5zdXJlIn19fQ")
decodedBytes: Array[Byte] = Array(123, 34, 115, 99, 104, 101, 109, 97, 34, 58, 34, 105, 103, 108, 117, 58, 99, 111, 109, 46, 115, 110, 111, 119, 112, 108, 111, 119, 97, 110, 97, 108, 121, 116, 105, 99, 115, 46, 115, 110, 111, 119, 112, 108, 111, 119, 47, 117, 110, 115, 116, 114, 117, 99, 116, 95, 101, 118, 101, 110, 116, 47, 106, 115, 111, 110, 115, 99, 104, 101, 109, 97, 47, 49, 45, 48, 45, 48, 34, 44, 34, 100, 97, 116, 97, 34, 58, 123, 34, 115, 99, 104, 101, 109, 97, 34, 58, 34, 105, 103, 108, 117, 58, 99, 111, 109, 46, 115, 110, 111, 119, 112, 108, 111, 119, 97, 110, 97, 108, 121, 116, 105, 99, 115, 46, 115, 110, 111, 119, 112, 108, 111, 119, 45, 119, 101, 98, 115, 105, 116, 101, 47, 115, 105, 103, 110, 117, 112, 95, 102, 111, 114, 109, 95, 115, 117, 98, 109, 105, 116, 116, 101, 100,...
scala> new String(decodedBytes)
res0: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"Χαριτίνη NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}
scala> new String(decodedBytes, java.nio.charset.StandardCharsets.UTF_8)
res1: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"Χαριτίνη NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}
scala> new String(decodedBytes, java.nio.charset.StandardCharsets.US_ASCII)
res2: String = {"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow-website/signup_form_submitted/jsonschema/1-0-0","data":{"name":"���������������� NEW Unicode test","email":"alex+test@snowplowanalytics.com","company":"SP","eventsPerMonth":"< 1 million","serviceType":"unsure"}}}
/cc @epantera, @dstendardi
Activity
Scala Common Enrich: Base64 decoding now specifies UTF-8 charset (fixes
epantera commentedon Feb 9, 2015
You Rock, thanks @alexanderdean
Scala Common Enrich: Base64 decoding now specifies UTF-8 charset (fixes
Scala Common Enrich: Base64 decoding now specifies UTF-8 charset (fixes
Scala Common Enrich: Base64 decoding now specifies UTF-8 charset (fixes
Scala Common Enrich: Base64 decoding now specifies UTF-8 charset (fixes
Base64 decoding now specifies UTF-8 charset (fixes snowplow/snowplow#…