Skip to content

Lexical ordering of strings with surrogate pairs #1346

Open
@dlurton

Description

Description

PartiQL can get the ordering of strings wrong if they contain surrogate pairs.

To Reproduce

This is a somewhat contrived example, but it demonstrates the point.

@Test
fun `lexical ordering of strings with surrogate pairs`() {
    // The codepoint of 'ꬰ' is U+AB30.
    // The codepoint of `💩` is U+1F4A9.
    // Therefore, `ꬰ` should be ordered first by PartiQL.

    // However, PartiQL currently falls back on the JVM to compare strings.  The JVM lexicographcailly
    // compares by UTF-16 code unit instead of full code point and this can cause strings with characters
    // requiring surrogate pairs to sort incorrectly.
    
    // Therefore this test fails.

    assertTrue(
        DEFAULT_COMPARATOR.compare(
            ExprValue.newString("ꬰ"),
            ExprValue.newString("💩")
        ) > 0,
        "'ꬰ' should come before '💩'"
    )
}

Expected Behavior

The test in the repro case should pass.

Additional Context

I can't think of anything else to add.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions