Add utf8 support to tokenizer #945

jgmdev · 2022-04-22T21:02:00Z

As reported by a user @n4ts on discord and discussed by other contributors, the tokenizer which is based on lua patterns has issues with utf8 characters not getting matched as shown on the screenshot below:

This PR makes an initial integration of https://github.com/starwing/luautf8 which was suggested by @TorchedSammy on the tokenizer only for discussion and testing, so far seems to be working without breaking anything:

Some of the points to cover are:

make use of the utf8 library on the tokenizer only?
replace lua string functions at startup with the utf8 counterparts and get utf8 support on everything for free?

adamharrison · 2022-04-24T16:27:11Z

I think this is important.

If we're going to support UTF-8 outside of text shaping; we should definitely do it in the tokenizer as well.

The issue is, obviously, this is quite a lot of code. The problem is, there's not really a good solution otherwise. It's not easy to replace the existing system without leveraging something like this.

I feel like we could have some sort of rework of the tokenizer to fix the issue, but at this juncture, I don't know what that'd look like. It does need a rework anyway, though, in order to handle recursive languages better, and to speed it up a bit, but for now, this would get us what we need to work with non-ASCII languages.

So I'm not opposed to this as-is; though I would like us to keep the door open to doing a tokenizer rewrite at some point in the future. With that in mind; I'd love it we could keep this library to use only in tokenizer (unless there's a compelling need to use it elsewhere; as far as I know; most places don't really need it).

jgmdev · 2022-04-24T18:51:44Z

Im going to add some string.u* functions so it can be easily used, eg:

string.ulen()
string.ugmatch()
etc...

So one can do mystring:ulen() instead of utf8.len(mystring)

I introduced some UTF8String class as a wrapper but im going to drop that in favor of mentioned solution.

Also it seems the utf8 lib has a char iterator (if I'm correct the utf8.next function) that could potentially be faster than the common.utf8_chars the core is using, when I get some time will try to write a benchmark to measure the difference.

adamharrison · 2022-04-26T13:41:25Z

OK; I tested this, and it seems to work. I'll merge this now, as per jgm's request on discord; but I just want to reiterate that I personally view this as a temporary solution to getting a proper tokenizer fix; as it is quite a lot of code to solve what is essentially the single issue that %wdoesn't pick up on multi-byte characters in syntax files.

add utf8 support to tokenizer

06b889d

github-actions bot added Category: C Core Category: Lua Core labels Apr 22, 2022

jgmdev added 2 commits April 25, 2022 14:15

wrap utf8 functions in string table using a 'u' prefix

386fbce

document new utf8 functions

49a0f6b

jgmdev force-pushed the PR/utf8-tokenizer-support branch from 53deaa4 to 49a0f6b Compare April 25, 2022 18:18

github-actions bot added the Category: Documentation label Apr 25, 2022

adamharrison merged commit e572c58 into lite-xl:master Apr 26, 2022

jgmdev mentioned this pull request Oct 8, 2022

Markdown syntax highlight broken by Scandinavian Letters ÆØÅ #1138

Closed

jgmdev mentioned this pull request Oct 23, 2022

Support for highlighting non-latin utf characters #610

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utf8 support to tokenizer #945

Add utf8 support to tokenizer #945

jgmdev commented Apr 22, 2022 •

edited

Loading

adamharrison commented Apr 24, 2022

jgmdev commented Apr 24, 2022

adamharrison commented Apr 26, 2022

Add utf8 support to tokenizer #945

Add utf8 support to tokenizer #945

Conversation

jgmdev commented Apr 22, 2022 • edited Loading

adamharrison commented Apr 24, 2022

jgmdev commented Apr 24, 2022

adamharrison commented Apr 26, 2022

jgmdev commented Apr 22, 2022 •

edited

Loading