Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf8 support to tokenizer #945

Merged
merged 3 commits into from
Apr 26, 2022

Conversation

jgmdev
Copy link
Member

@jgmdev jgmdev commented Apr 22, 2022

As reported by a user @n4ts on discord and discussed by other contributors, the tokenizer which is based on lua patterns has issues with utf8 characters not getting matched as shown on the screenshot below:

unknown

This PR makes an initial integration of https://github.com/starwing/luautf8 which was suggested by @TorchedSammy on the tokenizer only for discussion and testing, so far seems to be working without breaking anything:

2022-04-22_16:46:15

Some of the points to cover are:

  • make use of the utf8 library on the tokenizer only?
  • replace lua string functions at startup with the utf8 counterparts and get utf8 support on everything for free?

@adamharrison
Copy link
Member

I think this is important.

If we're going to support UTF-8 outside of text shaping; we should definitely do it in the tokenizer as well.

The issue is, obviously, this is quite a lot of code. The problem is, there's not really a good solution otherwise. It's not easy to replace the existing system without leveraging something like this.

I feel like we could have some sort of rework of the tokenizer to fix the issue, but at this juncture, I don't know what that'd look like. It does need a rework anyway, though, in order to handle recursive languages better, and to speed it up a bit, but for now, this would get us what we need to work with non-ASCII languages.

So I'm not opposed to this as-is; though I would like us to keep the door open to doing a tokenizer rewrite at some point in the future. With that in mind; I'd love it we could keep this library to use only in tokenizer (unless there's a compelling need to use it elsewhere; as far as I know; most places don't really need it).

@jgmdev
Copy link
Member Author

jgmdev commented Apr 24, 2022

Im going to add some string.u* functions so it can be easily used, eg:

  • string.ulen()
  • string.ugmatch()
  • etc...

So one can do mystring:ulen() instead of utf8.len(mystring)

I introduced some UTF8String class as a wrapper but im going to drop that in favor of mentioned solution.

Also it seems the utf8 lib has a char iterator (if I'm correct the utf8.next function) that could potentially be faster than the common.utf8_chars the core is using, when I get some time will try to write a benchmark to measure the difference.

@adamharrison
Copy link
Member

OK; I tested this, and it seems to work. I'll merge this now, as per jgm's request on discord; but I just want to reiterate that I personally view this as a temporary solution to getting a proper tokenizer fix; as it is quite a lot of code to solve what is essentially the single issue that %wdoesn't pick up on multi-byte characters in syntax files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants