Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-language tokenization #8

Open
inboxsphere opened this issue Nov 17, 2024 · 7 comments
Open

multi-language tokenization #8

inboxsphere opened this issue Nov 17, 2024 · 7 comments

Comments

@inboxsphere
Copy link

I need to index documents in multiple languages, such as English, German, Russian, Japanese, Korean, and Chinese. May I ask if these languages are currently supported? Does the system support n-gram tokenization? I hope to enable language-independent search for any UTF-8 string. Is this currently supported?

@wolfgarbe
Copy link
Member

English, German, and Russian are supported.
Japanese, Korean, and Chinese are currently only supported if both documents and queries are pre-tokenized by a tokenizer like https://github.com/messense/jieba-rs in a pre-processing step.
An integration of CJK word segmentation into SeekStorms tokenizer is on our roadmap: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#roadmap

@wolfgarbe
Copy link
Member

SeekStorm v0.11.0 has been released. The new tokenizer UnicodeAlphanumericZH implements Chinese word segmentation.

image

@inboxsphere
Copy link
Author

@wolfgarbe Cool,and any plans to add an n-gram tokenizer so that segmentation can be performed on any text?

@wolfgarbe
Copy link
Member

wolfgarbe commented Nov 30, 2024

@inboxsphere What would be your use case? Prefix/substring search? Or something else? For word segmentation a specialized word segmenting algorithm is more efficient than n-gram tokenizing.
I'm afraid the index size would explode if we index all possible n-grams of all words in a document. Would limiting the maximum n-gram length or some other reduction make sense?
See also: https://bigdataboutique.com/blog/dont-use-n-gram-in-elasticsearch-and-opensearch-6f0b48

@inboxsphere
Copy link
Author

@wolfgarbe We will index some relatively short fields, which may contain multiple languages. The languages are unknown in advance,so not sure which tokenizer to use, and I hope it can automatically adapt. The indexing will consider n-grams, but I'm wondering if there is a better approach

@wolfgarbe
Copy link
Member

@inboxsphere I see. The Chinese tokenizer (UnicodeAlphanumericZH) already handles mixed Chinese/Latin text. We could extend this so that when unknown (not in the Chinese dictionary) and non-Latin words (different Unicode code blocks) are detected, then we fall back to tokenizing only those unknown words using n-grams. This reduces n-gram generation to only those instances where they are needed.

N-gram tokenizing not only increases the index size but also hurts query latency. Posting lists become longer for n-grams, and we need to intersect (AND) more n-grams than we would have with words.

@inboxsphere
Copy link
Author

Got & looking forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants