Lexical ordering of strings with surrogate pairs #1346
Open
Description
Description
PartiQL can get the ordering of strings wrong if they contain surrogate pairs.
To Reproduce
This is a somewhat contrived example, but it demonstrates the point.
@Test
fun `lexical ordering of strings with surrogate pairs`() {
// The codepoint of 'ꬰ' is U+AB30.
// The codepoint of `💩` is U+1F4A9.
// Therefore, `ꬰ` should be ordered first by PartiQL.
// However, PartiQL currently falls back on the JVM to compare strings. The JVM lexicographcailly
// compares by UTF-16 code unit instead of full code point and this can cause strings with characters
// requiring surrogate pairs to sort incorrectly.
// Therefore this test fails.
assertTrue(
DEFAULT_COMPARATOR.compare(
ExprValue.newString("ꬰ"),
ExprValue.newString("💩")
) > 0,
"'ꬰ' should come before '💩'"
)
}
Expected Behavior
The test in the repro case should pass.
Additional Context
I can't think of anything else to add.