Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Petition the spec to consider non-ascii word boundaries #1225

Open
ShadowJonathan opened this issue Jun 24, 2022 · 5 comments
Open

Petition the spec to consider non-ascii word boundaries #1225

ShadowJonathan opened this issue Jun 24, 2022 · 5 comments

Comments

@ShadowJonathan
Copy link
Member

I really don't like all of this stuff being defined on ASCII, it degrades the experience for non-English usage (especially where the language used doesn't use a variation of the latin alphabet).

But that's what the spec says and just deviating from it w/o trying to change it is also pretty bad and would be against existing policy.

Originally posted by @jplatte in #1224 (review)


I'm noting this in a separate issue, to keep these thoughts organised, to address them at a later date.

@jplatte jplatte changed the title Petition the spec to consider non-ascii word boundries Petition the spec to consider non-ascii word boundaries Jun 24, 2022
@jplatte
Copy link
Member

jplatte commented Jun 24, 2022

Also worth noting that it's probably easy enough to make a Cargo feature for opting into this (currently not spec-compliant) behavior.

@ShadowJonathan
Copy link
Member Author

Should be worth noting that Unicode supports word boundaries; https://unicode.org/reports/tr29/

@zecakeh
Copy link
Contributor

zecakeh commented Jun 24, 2022

The regex crate also links to this part of Unicode about regexes: https://www.unicode.org/reports/tr18/#Compatibility_Properties

I think that asking this change for the spec we have to link to a spec of some kind (like these from unicode), instead of just saying "non-ASCII".

It should also be noted that it looks like the current Synapse implementation already uses unicode word boundaries.

@ShadowJonathan
Copy link
Member Author

Could you look at how long synapse has used that word boundary definition? Then it might be possible it’s eligible to be grandfathered into the spec, instead of requiring an MSC.

@zecakeh
Copy link
Contributor

zecakeh commented Jun 24, 2022

It looks like this commit from January 5th made it support unicode.

There was a comment about it only supporting ASCII before although I'm not sure why because the regex didn't change much between the two versions. Maybe it also changed with a Python version change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants