-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make UTS 46 normalization non-experimental #4712
Conversation
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes unicode-org#2850
The ICU PR is unicode-org/icu#2945 |
It seems that the remainin CI failures are due to the ICU4C part not yet contributing to the exported data, so it seems OK to start reviewing this. It doesn't make much sense to expose this API over FFI. If other languages want ICU4X-backed UTS 46, it makes sense to introduce FFI for https://github.com/hsivonen/rust-url/blob/icu4x/idna/src/uts46.rs . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tables, | ||
None, | ||
0xC0, | ||
IgnorableBehavior::Unsupported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is the default value for behavior of handling an ignored character not to ignore it (IgnorableBehavior::Ignored
)?
IgnorableBehavior::Ignored
is what UTS 46 would do by default, right?
Or do you default to Unsupported
because this is in the underlying Decomposition
struct, which can be used independently of UTS 46, and thus would want to be kept separate by default unless otherwise specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying struct can be used for non-UTS46 purposes: the usual normalizations. For those, we expose contiguous-buffer APIs, and the contiguous-buffer versions currently can't deal with ignorables, and since there is currently no use case for them to deal with ignorables, I think it's better to have a state that they can debug_assert!
against than to add support for ignorables is the contiguous-buffer entry points.
question + suggestion (optional): IIUC, at this point, UTS 46 is pretty much IDNA2008. Is that correct? If so, you could make a mention of that in the docs becuase UTS 46 covers not just IDNA2008 but also IDNA2003 and the transition rules between the two. If at this point, everyone has switched over to IDNA2008, including all major browsers, it would help the docs to add that extra specificity. |
No, UTS 46 non-transitional accepts a superset of what IDNA 2008 accepts and the major browsers use UTS 46 non-transitional specifically. (To use an example from the UTS 46 spec, switching to pure IDNA 2008 would break URLs like http://www.ÖBB.at/ .) Beyond the remark in the docs about the three major browser engines, I'd prefer to leave the characterization of the situation to the UTS 46 spec itself.
Thanks! Since unicode-org/icu#2945 looks ready to land, I'll open another ICU4C PR with a backport to the maintenance branch. Then we can make datagen over on the ICU4X side pull the updated export before merging this. |
Opened #4905 about using data exported with the backport of the ICU4C patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving after merge and clippy fix. Replying on @echeran's approval of business logic.
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes unicode-org#2850
* Bake ignored/disallow data into the normalization data after all. * Make public operations available via a dedicated wrapper type instead of the main normalizer types. Closes #2850
Closes #2850