LUCENE-9825: Hunspell: reverse the "words" trie for faster word lookup/suggestions #2459

donnerpeter · 2021-03-05T15:12:20Z

Description

FST lookup and especially traversal are too slow

Solution

Replace FST with a similarly-shaped trie-like data structure, but with nodes referring to their parents instead of their children, plus a hash map into its leaves. I suspect such data structure has a name, but couldn't find it.

Tests

No new tests, but this makes some notable performance-related changes in existing ones:
~25% speedup in TestAllDictionaries: no word FST compiling during dictionary loading now
~35% speedup in TestPerformance: lookup/enumeration have become faster; stemming/spellchecking/suggestions are affected to varying degrees, e.g. 45-50% speedup for English
But this has some cost: ~70% more memory on average as printed by TestAllDictionaries on 178 wooorm/libre dictionaries.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the master branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Ref Guide (for Solr changes only).

…p/suggestions

donnerpeter · 2021-03-05T15:13:33Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java

-  /** A Bloom filter over {@link #words} to avoid unnecessary expensive FST traversals */
-  FixedBitSet wordHashes;
+  /** The entries in the .dic file, mapping to their set of flags */
+  WordStorage words;


Moved the word information into a dedicated class

donnerpeter · 2021-03-05T15:15:00Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java

-    if (fst == null) {
-      return null;
-    }
+  private IntsRef lookup(FST<IntsRef> fst, char[] word) {


Now queries are always on the entire array, and the fst is not-null. This code is called only from a single test class and might deserve moving there.

donnerpeter · 2021-03-05T15:16:17Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java

@@ -1275,76 +1258,6 @@ boolean isDotICaseChangeDisallowed(char[] word) {
    return word[0] == 'İ' && !alternateCasing;
  }

-  private class EntryGrouper {


The same-word-grouping logic in this class has been moved to WordStorage.Builder, and appended with building the reverse trie

donnerpeter · 2021-03-05T15:18:22Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/ModifyingSuggester.java

@@ -234,6 +234,8 @@ private void tryLongSwap(String word) {
  }

  private void tryRemovingChar(String word) {
+    if (word.length() == 1) return;


We don't accept empty words for lookup anymore

donnerpeter · 2021-03-05T15:18:32Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java

@@ -94,6 +94,10 @@ public Stemmer(Dictionary dictionary) {
    }

    List<CharsRef> list = new ArrayList<>();
+    if (length == 0) {


We don't accept empty words for lookup anymore, again

LOL empty string.

rmuir · 2021-03-05T16:38:18Z

I like the idea, optimize for the operations hunspell needs to do. do you happen to have any more details on the memory usage increase (e.g. by language)? I imagine things are ok, since there is still prefix compression, but just want to confirm nothing really explodes in a bad way.

donnerpeter · 2021-03-05T16:49:34Z

The largest dictionaries in my collection:
el_GR: 1.7 MB -> 11.1 MB
gd: 14.5 MB -> 18.3 MB
fy: 36.7 MB -> 39.3 MB
rw: remains 18.8 MB (because it's all affixes, not words)

The most common (probably):
en_US: 409 KB -> 712 KB
de_DE: 2.5 MB -> 4.7 MB
fr: 830 KB -> 1.4 MB

Total for 178 dictionaries (some of them duplicate): 274 MB -> 463 MB

So it's increasing, but I wouldn't call it an explosion. Greek seems to grow the most, but for me that'd be acceptable. What about you?

rmuir · 2021-03-05T17:40:15Z

greek dictionary has 828806 words, so the hash table alone is 3.2MB, right?

as far as encoding of actual words (n * VINT: the word form data), maybe it is possible to explore n * ZINT, where you encode deltas from previously encoded character? Then this way, writing systems like greek, russian would typically only use single byte/char, because of how they are organized in unicode blocks. See {{readZInt()}} for more information.

rmuir · 2021-03-05T18:00:12Z

For greek, if you analyze the distribution of dictionary (I use https://scripts.sil.org/UnicodeCharacterCount ), you can see that smallest character in the whole dictionary is 0x386 (decimal 902) and largest is 0x3CE (decimal 974). So, even simpler, you could exploit that and encode a single int base = 0x386; // smallest char in use for this whole dictionary and all characters would be single-byte encoded.

rmuir · 2021-03-05T18:13:57Z

a lot of these dictionaries are encoded in legacy charsets which explains why they conform so well to their unicode block, e.g. zero stray ASCII or chinese or whatever in the greek one...

donnerpeter · 2021-03-05T18:35:44Z

A great idea about the base char! It's much simpler than what I had in mind if Greek becomes a problem :) Should I do this in this PR or in another one?

donnerpeter · 2021-03-05T18:54:17Z

I've found a regression not covered by unit tests, will fix it (probably on Monday), please don't merge yet :)

rmuir · 2021-03-05T19:08:03Z

we can definitely do this stuff in another PR! I'm just throwing out wacky ideas to try to help out.

mikemccand

This looks awesome!

Many of my comments are confused because I did not really understand the change and was commenting too soon ;) Sorry.

And none of my comments should block pushing this -- this looks like major performance improvement, and our tests should catch any regressions here. Progress not perfection!

This looks like a faster / less compact Map<char[],int[]> than FST, and looks maybe loosely inspired by FST except 1) it looks up entries in reverse order (last char first), 2) it has hash map to directly jump to the tail of each entry, and 3) does not share suffixes of all words, so no RAM hungry nodeHash (but does share common prefixes).

I would love to see this (later) factored out as its own dedicated data structure under oal.util for a nice alternative to FST.

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

mikemccand · 2021-03-07T19:56:54Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

+    while (true) {
+      in.setPosition(pos);
+      char c = (char) in.readVInt();
+      int prevPos = pos - in.readVInt();


You might want to fail-fast(er) here by checking if (c != lastChar && !collission) { return null; }? Might give small speedup for words not in the dictionary, though, that is likely the rare case (most lookups exist, or, share a suffix that exists)?

Thanks, but I've tried that and it doesn't seem to make any difference to performance.

mikemccand · 2021-03-07T20:01:31Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

+        skipVInts(in, formLength);
+      }
+
+      collision = in.readByte() == 1;


Instead of burning whole byte to represent "next entry still has a collision", you could steal one bit from the delta pos you read next? It'd make this a bit more compact, and would be consistent with the up front pos < 0 collision check which is single bit encoding.

(delta is a great word, I'll use it, thanks!)

Oh, appending the lowest bit to the delta looks nice, I haven't thought of this! However, I've another idea for using that byte: to store the word length, which might speed up lookup and especially enumeration. I plan to test whether this is true, and if not, will append the bit.

Good idea (storing word length) -- that'd make lookups (that hashed to the same mod location but have different lengths) fail faster.

mikemccand · 2021-03-07T20:06:23Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

+        chars.chars[wordStart] = (char) in.readVInt();
+        int prevPos = pos - in.readVInt();
+
+        int dataLength = in.readVInt();


If you are willing to say "no more than 256 forms" you could make this a .readByte() which goes a bit faster than .readVInt, maybe.

Unfortunately some crazy dictionaries have numerous homonyms, like this:

Alberto po:antropónimo [n-grama: Alberto Arnoldi] Alberto po:antropónimo [n-grama: Alberto Avendaño Prieto] Alberto po:antropónimo [n-grama: Alberto Bazzoni] Alberto po:antropónimo [n-grama: Alberto Benito] Alberto po:antropónimo [n-grama: Alberto Burri] ...

The funny thing is that we don't even store/support this custom n-grama thing. But Hunspell stores it, so we might, too, in future. So I'm not sure we should collapse the duplicates just for this case right now. Your opinion?

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

mikemccand · 2021-03-07T20:49:03Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

+        }
+      }
+
+      int lastPos = commonPrefixPos;


Maybe add // write new suffix for this entry comment?

Added something else :)

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

mikemccand · 2021-03-07T20:53:48Z

lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestPerformance.java

  }

  @Test
  public void en_suggest() throws Exception {
-    checkSuggestionPerformance("en", 1_200);
+    checkSuggestionPerformance("en", 3_000);


Hmm what are these magical constant numbers? And why does this change mean they should increase?

These are the number of words to check in the test so that its duration is noticeable enough for profiling, but not too long for me to get bored when running it locally :) The change increases the speed of suggestions, so the test may now check more words in about the same time.

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

donnerpeter · 2021-03-08T08:01:51Z

I would love to see this (later) factored out as its own dedicated data structure under oal.util for a nice alternative to FST.

@mikemccand It's certainly possible (especially if we come up with a nice name; so far the best I have is Reverse(d)Trie or PrefixSharingStringMap), but I'm not sure that'd be useful to anyone else, as it's quite specialized to Hunspell scenarios (and might become even more so with further optimizations). Do you have candidates in mind for using this structure?

dweiss · 2021-03-08T08:15:34Z

bq. Progress not perfection!

I hate that quote, Mike. :)

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java

donnerpeter · 2021-03-08T10:08:28Z

For greek, if you analyze the distribution of dictionary (I use https://scripts.sil.org/UnicodeCharacterCount ), you can see that smallest character in the whole dictionary is 0x386 (decimal 902) and largest is 0x3CE (decimal 974). So, even simpler, you could exploit that and encode a single int base = 0x386; // smallest char in use for this whole dictionary and all characters would be single-byte encoded.

@rmuir This seems to bring memory usage a bit down, but not so much as I'd hope (11.1->9.4 MB for Greek, 463->458 MB total). As it also complicates the code, I'd leave this out, at least for now.

donnerpeter · 2021-03-08T10:21:48Z

I've fixed the encountered bugs, added regression tests for them, and applied most of these great suggestions, thanks!

donnerpeter · 2021-03-09T15:55:09Z

Hi, any chance to get this merged before the Great Repository Split?

dweiss · 2021-03-09T16:13:51Z

Can I use this PR as an example of how to port a PR to the new repo? ;)

donnerpeter · 2021-03-09T16:26:17Z

OK :)

mikemccand · 2021-03-09T18:33:21Z

bq. Progress not perfection!

I hate that quote, Mike. :)

Wow, why? :)

mikemccand · 2021-03-09T18:43:34Z

Do you have candidates in mind for using this structure?

Well, maybe anywhere we use a UTF16 based FST today, and we might want faster lookup and can afford more RAM used ... e.g. maybe SynonymGraphFilter?

Anyway this can come later :)

mikemccand

This looks great! I think we should push it, unless @dweiss really struggles to find another guinea pig PR for migrating patch across the upcoming fork :)

rmuir · 2021-03-09T22:04:28Z

Do you have candidates in mind for using this structure?

Well, maybe anywhere we use a UTF16 based FST today, and we might want faster lookup and can afford more RAM used ... e.g. maybe SynonymGraphFilter?

Anyway this can come later :)

Maybe it is interesting for use-cases like kuromoji (japanese) and nori (korean) which also use FST (sometimes with hacky root-arc caching to improve performance) today. They have a similar use case to decompounder+stemmer as far as what they are doing.

dweiss · 2021-03-10T08:29:43Z

bq. Wow, why? :)

Because I'm a perfectionist and it goes against my nature to show something I don't think is ideal? :) I know, it's bad. I'm trying to fight it, but it's not easy. :)

dweiss · 2021-03-10T11:10:21Z

Closed, ported PR: apache/lucene#2

mikemccand · 2021-03-10T12:31:29Z

bq. Wow, why? :)

Because I'm a perfectionist and it goes against my nature to show something I don't think is ideal? :) I know, it's bad. I'm trying to fight it, but it's not easy. :)

LOL, OK I see, that's fair. It's great to strive to be a perfectionist! You get even more progress this way! Maybe we needed dedicated Psychiatrists (Psychologists?) who know how to help us with the unique struggles in open-source software!

dweiss · 2021-03-10T12:33:23Z

All I need is you to remind me about that perfection vs. progress thing... :)

msokolov · 2021-03-11T01:17:08Z

Here's another pithy phrase for you: don't let the perfect be the enemy of the good

…

On Wed, Mar 10, 2021, 7:33 AM Dawid Weiss ***@***.***> wrote: All I need is you to remind me about that perfection vs. progress thing... :) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2459 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHHUQJVAXBG7E3OFEDQUTLTC5RKLANCNFSM4YVMF3VA> .

LUCENE-9825: Hunspell: reverse the "words" trie for faster word looku…

4842e0c

…p/suggestions

donnerpeter commented Mar 5, 2021

View reviewed changes

update comment

fb9805f

donnerpeter commented Mar 5, 2021

View reviewed changes

make fields private

7b048c5

mikemccand reviewed Mar 7, 2021

View reviewed changes

dweiss approved these changes Mar 8, 2021

View reviewed changes

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java Outdated Show resolved Hide resolved

donnerpeter added 5 commits March 8, 2021 11:09

fix lookupWord false positive

469cfc6

don't lookup empty stems after stripping the whole word

e69390b

add/fix WordStorage comments

e28b50b

fix processAllWords, add a test

f9cd8e5

fix minor review comments

4959886

mikemccand approved these changes Mar 9, 2021

View reviewed changes

rmuir approved these changes Mar 9, 2021

View reviewed changes

dweiss closed this Mar 10, 2021

asfimport mentioned this pull request Mar 11, 2021

Hunspell: reverse the "words" trie for faster word lookup/suggestions [LUCENE-9825] apache/lucene#10864

Closed

LUCENE-9825: Hunspell: reverse the "words" trie for faster word lookup/suggestions #2459

LUCENE-9825: Hunspell: reverse the "words" trie for faster word lookup/suggestions #2459

Conversation

donnerpeter commented Mar 5, 2021 • edited Loading

Description

Solution

Tests

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmuir commented Mar 5, 2021

donnerpeter commented Mar 5, 2021 • edited Loading

rmuir commented Mar 5, 2021

rmuir commented Mar 5, 2021

rmuir commented Mar 5, 2021

donnerpeter commented Mar 5, 2021

donnerpeter commented Mar 5, 2021

rmuir commented Mar 5, 2021

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter Mar 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter Mar 8, 2021 • edited Loading

Choose a reason for hiding this comment

donnerpeter commented Mar 8, 2021

dweiss commented Mar 8, 2021

donnerpeter commented Mar 8, 2021

donnerpeter commented Mar 8, 2021

donnerpeter commented Mar 9, 2021

dweiss commented Mar 9, 2021

donnerpeter commented Mar 9, 2021

mikemccand commented Mar 9, 2021

mikemccand commented Mar 9, 2021

mikemccand left a comment

Choose a reason for hiding this comment

rmuir commented Mar 9, 2021

dweiss commented Mar 10, 2021

dweiss commented Mar 10, 2021

mikemccand commented Mar 10, 2021

dweiss commented Mar 10, 2021

msokolov commented Mar 11, 2021 via email

donnerpeter commented Mar 5, 2021 •

edited

Loading

donnerpeter commented Mar 5, 2021 •

edited

Loading

donnerpeter Mar 8, 2021 •

edited

Loading

donnerpeter Mar 8, 2021 •

edited

Loading