docs: `length` trait counts number of Unicode scalar values for strings #1089

david-perez · 2022-02-14T18:08:35Z

Since Smithy defines string shapes to be UTF-8 encoded, it is more
correct for the spec to explicitly say the length trait counts the
number of Unicode scalar values when applied to strings.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Since Smithy defines string shapes to be UTF-8 encoded, it is more correct for the spec to explicitly say the `length` trait counts the number of Unicode scalar values when applied to strings.

mtdowling · 2022-02-18T19:54:58Z

What makes this more correct? Is it saying the same thing as what JSON schema's length validator in Java already does (i.e., value.codePointCount(0, value.length())), which matches how API Gateway interprets length constraints too? Smithy should be compatible with JSON Schema's interpretation of length (which seems vague to me, so I'm basing it on JSON schema library implementations).

david-perez · 2022-02-19T00:26:02Z

The previous wording could be confusing in that someone could misinterpret the length trait as counting the Unicode code points in the surrogate pairs range from D800 to DFFF too, which is UTF-16 specific. Since Smithy only contemplates "UTF-8 encoded string shapes¹", it's impossible to count Unicode code points from that range. The set of Unicode scalar values is precisely the set of all Unicode code points minus that range.

So both wordings are technically correct given the context, but this new one is context-independent and in my opinion clearer.

Is it saying the same thing as what JSON schema's length validator in Java already does (i.e., value.codePointCount(0, value.length())), which matches how API Gateway interprets length constraints too?

Yes.

whatever that may mean, since in my opinion there's not a clear explanation of what that statement actually entails in the spec ↩

docs: length trait counts number of Unicode scalar values for strings

23deabc

Since Smithy defines string shapes to be UTF-8 encoded, it is more correct for the spec to explicitly say the `length` trait counts the number of Unicode scalar values when applied to strings.

david-perez requested a review from a team as a code owner February 14, 2022 18:08

This was referenced Feb 14, 2022

Protocol tests with strings containing Unicode supplementary characters and length trait #1090

Closed

Add server SDK constraint traits RFC smithy-lang/smithy-rs#1199

Merged

mtdowling approved these changes Feb 21, 2022

View reviewed changes

mtdowling merged commit 9cd4e10 into smithy-lang:main Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: `length` trait counts number of Unicode scalar values for strings #1089

docs: `length` trait counts number of Unicode scalar values for strings #1089

david-perez commented Feb 14, 2022

mtdowling commented Feb 18, 2022

david-perez commented Feb 19, 2022

docs: length trait counts number of Unicode scalar values for strings #1089

docs: length trait counts number of Unicode scalar values for strings #1089

Conversation

david-perez commented Feb 14, 2022

mtdowling commented Feb 18, 2022

david-perez commented Feb 19, 2022

Footnotes

docs: `length` trait counts number of Unicode scalar values for strings #1089

docs: `length` trait counts number of Unicode scalar values for strings #1089