Protocol tests with strings containing Unicode supplementary characters and `length` trait

Smithy defines string shapes to be UTF-8 encoded, and the `length` trait, when applied to strings, [counts the number of Unicode code points](https://awslabs.github.io/smithy/1.0/spec/core/constraint-traits.html?highlight=enum#length-trait) (although I'd like that to be [changed and say Unicode scalar values](https://github.com/awslabs/smithy/pull/1089)).

(I've only checked the restJson1 protocol tests, however, the below most likely applies to all test suites). However, I haven't been able to find any protocol tests that exercise this trait enforcement with strings containing Unicode code points outside the basic multilingual plane. I feel like such tests are very important, lest an SDK in a programming language where the canonical string type does not use UTF-8 encoding (e.g. Java uses UTF-16 in its `String` class) implements this trait "intuitively" incorrectly (e.g. in Java `length()` will return the number of Unicode _code units_, whereas the SDK implementer should probably use [`codePointCount()`](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#codePointCount(int,int)) to be compliant with the Smithy spec).

Here's an example in Java with the first Unicode supplementary character, U+10000, highlighting how things could go subtly wrong:

```java
import java.nio.charset.StandardCharsets;

class Main {  
    public static void main(String args[]) {
        byte[] b = {(byte) 0xd8, (byte) 0x00, (byte) 0xdc, (byte) 0x00}; // UTF-16 encoding of U+10000.
        String s = new String(b, StandardCharsets.UTF_16);
        // System.out.println(s);
        System.out.println(s.codePointAt(0)); // 65536
        System.out.println(s.length()); // 2
        System.out.println(s.codePointCount(0, s.length())); // 1
    }
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocol tests with strings containing Unicode supplementary characters and `length` trait #1090

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Protocol tests with strings containing Unicode supplementary characters and length trait #1090

Description