-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-language tokenization #8
Comments
English, German, and Russian are supported. |
SeekStorm v0.11.0 has been released. The new tokenizer UnicodeAlphanumericZH implements Chinese word segmentation. |
@wolfgarbe Cool,and any plans to add an n-gram tokenizer so that segmentation can be performed on any text? |
@inboxsphere What would be your use case? Prefix/substring search? Or something else? For word segmentation a specialized word segmenting algorithm is more efficient than n-gram tokenizing. |
@wolfgarbe We will index some relatively short fields, which may contain multiple languages. The languages are unknown in advance,so not sure which tokenizer to use, and I hope it can automatically adapt. The indexing will consider n-grams, but I'm wondering if there is a better approach |
@inboxsphere I see. The Chinese tokenizer (UnicodeAlphanumericZH) already handles mixed Chinese/Latin text. We could extend this so that when unknown (not in the Chinese dictionary) and non-Latin words (different Unicode code blocks) are detected, then we fall back to tokenizing only those unknown words using n-grams. This reduces n-gram generation to only those instances where they are needed. N-gram tokenizing not only increases the index size but also hurts query latency. Posting lists become longer for n-grams, and we need to intersect (AND) more n-grams than we would have with words. |
Got & looking forward to it. |
I need to index documents in multiple languages, such as English, German, Russian, Japanese, Korean, and Chinese. May I ask if these languages are currently supported? Does the system support n-gram tokenization? I hope to enable language-independent search for any UTF-8 string. Is this currently supported?
The text was updated successfully, but these errors were encountered: