Skip to content

Protocol tests with strings containing Unicode supplementary characters and length trait #1090

Closed
@david-perez

Description

Smithy defines string shapes to be UTF-8 encoded, and the length trait, when applied to strings, counts the number of Unicode code points (although I'd like that to be changed and say Unicode scalar values).

(I've only checked the restJson1 protocol tests, however, the below most likely applies to all test suites). However, I haven't been able to find any protocol tests that exercise this trait enforcement with strings containing Unicode code points outside the basic multilingual plane. I feel like such tests are very important, lest an SDK in a programming language where the canonical string type does not use UTF-8 encoding (e.g. Java uses UTF-16 in its String class) implements this trait "intuitively" incorrectly (e.g. in Java length() will return the number of Unicode code units, whereas the SDK implementer should probably use codePointCount() to be compliant with the Smithy spec).

Here's an example in Java with the first Unicode supplementary character, U+10000, highlighting how things could go subtly wrong:

import java.nio.charset.StandardCharsets;

class Main {  
    public static void main(String args[]) {
        byte[] b = {(byte) 0xd8, (byte) 0x00, (byte) 0xdc, (byte) 0x00}; // UTF-16 encoding of U+10000.
        String s = new String(b, StandardCharsets.UTF_16);
        // System.out.println(s);
        System.out.println(s.codePointAt(0)); // 65536
        System.out.println(s.length()); // 2
        System.out.println(s.codePointCount(0, s.length())); // 1
    }
}

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions