multi-language tokenization #8

inboxsphere · 2024-11-17T19:15:14Z

I need to index documents in multiple languages, such as English, German, Russian, Japanese, Korean, and Chinese. May I ask if these languages are currently supported? Does the system support n-gram tokenization? I hope to enable language-independent search for any UTF-8 string. Is this currently supported?

wolfgarbe · 2024-11-17T19:37:23Z

English, German, and Russian are supported.
Japanese, Korean, and Chinese are currently only supported if both documents and queries are pre-tokenized by a tokenizer like https://github.com/messense/jieba-rs in a pre-processing step.
An integration of CJK word segmentation into SeekStorms tokenizer is on our roadmap: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#roadmap

wolfgarbe · 2024-11-28T20:37:42Z

SeekStorm v0.11.0 has been released. The new tokenizer UnicodeAlphanumericZH implements Chinese word segmentation.

inboxsphere · 2024-11-30T17:21:42Z

@wolfgarbe Cool，and any plans to add an n-gram tokenizer so that segmentation can be performed on any text?

wolfgarbe · 2024-11-30T17:42:31Z

@inboxsphere What would be your use case? Prefix/substring search? Or something else? For word segmentation a specialized word segmenting algorithm is more efficient than n-gram tokenizing.
I'm afraid the index size would explode if we index all possible n-grams of all words in a document. Would limiting the maximum n-gram length or some other reduction make sense?
See also: https://bigdataboutique.com/blog/dont-use-n-gram-in-elasticsearch-and-opensearch-6f0b48

inboxsphere · 2024-11-30T17:51:39Z

@wolfgarbe We will index some relatively short fields, which may contain multiple languages. The languages are unknown in advance，so not sure which tokenizer to use, and I hope it can automatically adapt. The indexing will consider n-grams, but I'm wondering if there is a better approach

wolfgarbe · 2024-11-30T18:13:25Z

@inboxsphere I see. The Chinese tokenizer (UnicodeAlphanumericZH) already handles mixed Chinese/Latin text. We could extend this so that when unknown (not in the Chinese dictionary) and non-Latin words (different Unicode code blocks) are detected, then we fall back to tokenizing only those unknown words using n-grams. This reduces n-gram generation to only those instances where they are needed.

N-gram tokenizing not only increases the index size but also hurts query latency. Posting lists become longer for n-grams, and we need to intersect (AND) more n-grams than we would have with words.

inboxsphere · 2024-11-30T18:40:14Z

Got & looking forward to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-language tokenization #8

multi-language tokenization #8

inboxsphere commented Nov 17, 2024

wolfgarbe commented Nov 17, 2024

wolfgarbe commented Nov 28, 2024

inboxsphere commented Nov 30, 2024

wolfgarbe commented Nov 30, 2024 •

edited

Loading

inboxsphere commented Nov 30, 2024

wolfgarbe commented Nov 30, 2024

inboxsphere commented Nov 30, 2024

multi-language tokenization #8

multi-language tokenization #8

Comments

inboxsphere commented Nov 17, 2024

wolfgarbe commented Nov 17, 2024

wolfgarbe commented Nov 28, 2024

inboxsphere commented Nov 30, 2024

wolfgarbe commented Nov 30, 2024 • edited Loading

inboxsphere commented Nov 30, 2024

wolfgarbe commented Nov 30, 2024

inboxsphere commented Nov 30, 2024

wolfgarbe commented Nov 30, 2024 •

edited

Loading