Protocol tests with strings containing Unicode supplementary characters and length
trait #1090
Description
Smithy defines string shapes to be UTF-8 encoded, and the length
trait, when applied to strings, counts the number of Unicode code points (although I'd like that to be changed and say Unicode scalar values).
(I've only checked the restJson1 protocol tests, however, the below most likely applies to all test suites). However, I haven't been able to find any protocol tests that exercise this trait enforcement with strings containing Unicode code points outside the basic multilingual plane. I feel like such tests are very important, lest an SDK in a programming language where the canonical string type does not use UTF-8 encoding (e.g. Java uses UTF-16 in its String
class) implements this trait "intuitively" incorrectly (e.g. in Java length()
will return the number of Unicode code units, whereas the SDK implementer should probably use codePointCount()
to be compliant with the Smithy spec).
Here's an example in Java with the first Unicode supplementary character, U+10000, highlighting how things could go subtly wrong:
import java.nio.charset.StandardCharsets;
class Main {
public static void main(String args[]) {
byte[] b = {(byte) 0xd8, (byte) 0x00, (byte) 0xdc, (byte) 0x00}; // UTF-16 encoding of U+10000.
String s = new String(b, StandardCharsets.UTF_16);
// System.out.println(s);
System.out.println(s.codePointAt(0)); // 65536
System.out.println(s.length()); // 2
System.out.println(s.codePointCount(0, s.length())); // 1
}
}