Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: length trait counts number of Unicode scalar values for strings #1089

Merged

Conversation

david-perez
Copy link
Contributor

Since Smithy defines string shapes to be UTF-8 encoded, it is more
correct for the spec to explicitly say the length trait counts the
number of Unicode scalar values when applied to strings.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Since Smithy defines string shapes to be UTF-8 encoded, it is more
correct for the spec to explicitly say the `length` trait counts the
number of Unicode scalar values when applied to strings.
@mtdowling
Copy link
Member

What makes this more correct? Is it saying the same thing as what JSON schema's length validator in Java already does (i.e., value.codePointCount(0, value.length())), which matches how API Gateway interprets length constraints too? Smithy should be compatible with JSON Schema's interpretation of length (which seems vague to me, so I'm basing it on JSON schema library implementations).

@david-perez
Copy link
Contributor Author

The previous wording could be confusing in that someone could misinterpret the length trait as counting the Unicode code points in the surrogate pairs range from D800 to DFFF too, which is UTF-16 specific. Since Smithy only contemplates "UTF-8 encoded string shapes1", it's impossible to count Unicode code points from that range. The set of Unicode scalar values is precisely the set of all Unicode code points minus that range.

So both wordings are technically correct given the context, but this new one is context-independent and in my opinion clearer.

Is it saying the same thing as what JSON schema's length validator in Java already does (i.e., value.codePointCount(0, value.length())), which matches how API Gateway interprets length constraints too?

Yes.

Footnotes

  1. whatever that may mean, since in my opinion there's not a clear explanation of what that statement actually entails in the spec

@mtdowling mtdowling merged commit 9cd4e10 into smithy-lang:main Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants