-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9825: Hunspell: reverse the "words" trie for faster word lookup/suggestions #2459
Conversation
/** A Bloom filter over {@link #words} to avoid unnecessary expensive FST traversals */ | ||
FixedBitSet wordHashes; | ||
/** The entries in the .dic file, mapping to their set of flags */ | ||
WordStorage words; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the word information into a dedicated class
if (fst == null) { | ||
return null; | ||
} | ||
private IntsRef lookup(FST<IntsRef> fst, char[] word) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now queries are always on the entire array, and the fst is not-null. This code is called only from a single test class and might deserve moving there.
@@ -1275,76 +1258,6 @@ boolean isDotICaseChangeDisallowed(char[] word) { | |||
return word[0] == 'İ' && !alternateCasing; | |||
} | |||
|
|||
private class EntryGrouper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same-word-grouping logic in this class has been moved to WordStorage.Builder
, and appended with building the reverse trie
@@ -234,6 +234,8 @@ private void tryLongSwap(String word) { | |||
} | |||
|
|||
private void tryRemovingChar(String word) { | |||
if (word.length() == 1) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't accept empty words for lookup anymore
@@ -94,6 +94,10 @@ public Stemmer(Dictionary dictionary) { | |||
} | |||
|
|||
List<CharsRef> list = new ArrayList<>(); | |||
if (length == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't accept empty words for lookup anymore, again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL empty string.
I like the idea, optimize for the operations hunspell needs to do. do you happen to have any more details on the memory usage increase (e.g. by language)? I imagine things are ok, since there is still prefix compression, but just want to confirm nothing really explodes in a bad way. |
The largest dictionaries in my collection: The most common (probably): Total for 178 dictionaries (some of them duplicate): 274 MB -> 463 MB So it's increasing, but I wouldn't call it an explosion. Greek seems to grow the most, but for me that'd be acceptable. What about you? |
greek dictionary has as far as encoding of actual words ( |
For greek, if you analyze the distribution of dictionary (I use https://scripts.sil.org/UnicodeCharacterCount ), you can see that smallest character in the whole dictionary is |
a lot of these dictionaries are encoded in legacy charsets which explains why they conform so well to their unicode block, e.g. zero stray ASCII or chinese or whatever in the greek one... |
A great idea about the base char! It's much simpler than what I had in mind if Greek becomes a problem :) Should I do this in this PR or in another one? |
I've found a regression not covered by unit tests, will fix it (probably on Monday), please don't merge yet :) |
we can definitely do this stuff in another PR! I'm just throwing out wacky ideas to try to help out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome!
Many of my comments are confused because I did not really understand the change and was commenting too soon ;) Sorry.
And none of my comments should block pushing this -- this looks like major performance improvement, and our tests should catch any regressions here. Progress not perfection!
This looks like a faster / less compact Map<char[],int[]>
than FST, and looks maybe loosely inspired by FST except 1) it looks up entries in reverse order (last char first), 2) it has hash map to directly jump to the tail of each entry, and 3) does not share suffixes of all words, so no RAM hungry nodeHash
(but does share common prefixes).
I would love to see this (later) factored out as its own dedicated data structure under oal.util
for a nice alternative to FST.
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Show resolved
Hide resolved
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Outdated
Show resolved
Hide resolved
while (true) { | ||
in.setPosition(pos); | ||
char c = (char) in.readVInt(); | ||
int prevPos = pos - in.readVInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to fail-fast(er) here by checking if (c != lastChar && !collission) { return null; }
? Might give small speedup for words not in the dictionary, though, that is likely the rare case (most lookups exist, or, share a suffix that exists)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but I've tried that and it doesn't seem to make any difference to performance.
skipVInts(in, formLength); | ||
} | ||
|
||
collision = in.readByte() == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of burning whole byte to represent "next entry still has a collision", you could steal one bit from the delta pos
you read next? It'd make this a bit more compact, and would be consistent with the up front pos < 0
collision check which is single bit encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(delta
is a great word, I'll use it, thanks!)
Oh, appending the lowest bit to the delta looks nice, I haven't thought of this! However, I've another idea for using that byte: to store the word length, which might speed up lookup and especially enumeration. I plan to test whether this is true, and if not, will append the bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea (storing word length) -- that'd make lookups (that hashed to the same mod location but have different lengths) fail faster.
chars.chars[wordStart] = (char) in.readVInt(); | ||
int prevPos = pos - in.readVInt(); | ||
|
||
int dataLength = in.readVInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are willing to say "no more than 256 forms" you could make this a .readByte()
which goes a bit faster than .readVInt
, maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately some crazy dictionaries have numerous homonyms, like this:
Alberto po:antropónimo [n-grama: Alberto Arnoldi]
Alberto po:antropónimo [n-grama: Alberto Avendaño Prieto]
Alberto po:antropónimo [n-grama: Alberto Bazzoni]
Alberto po:antropónimo [n-grama: Alberto Benito]
Alberto po:antropónimo [n-grama: Alberto Burri]
...
The funny thing is that we don't even store/support this custom n-grama
thing. But Hunspell stores it, so we might, too, in future. So I'm not sure we should collapse the duplicates just for this case right now. Your opinion?
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
int lastPos = commonPrefixPos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add // write new suffix for this entry
comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something else :)
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void en_suggest() throws Exception { | ||
checkSuggestionPerformance("en", 1_200); | ||
checkSuggestionPerformance("en", 3_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what are these magical constant numbers? And why does this change mean they should increase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the number of words to check in the test so that its duration is noticeable enough for profiling, but not too long for me to get bored when running it locally :) The change increases the speed of suggestions, so the test may now check more words in about the same time.
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Show resolved
Hide resolved
@mikemccand It's certainly possible (especially if we come up with a nice name; so far the best I have is Reverse(d)Trie or PrefixSharingStringMap), but I'm not sure that'd be useful to anyone else, as it's quite specialized to Hunspell scenarios (and might become even more so with further optimizations). Do you have candidates in mind for using this structure? |
bq. Progress not perfection! I hate that quote, Mike. :) |
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java
Outdated
Show resolved
Hide resolved
@rmuir This seems to bring memory usage a bit down, but not so much as I'd hope (11.1->9.4 MB for Greek, 463->458 MB total). As it also complicates the code, I'd leave this out, at least for now. |
I've fixed the encountered bugs, added regression tests for them, and applied most of these great suggestions, thanks! |
Hi, any chance to get this merged before the Great Repository Split? |
Can I use this PR as an example of how to port a PR to the new repo? ;) |
OK :) |
Wow, why? :) |
Well, maybe anywhere we use a UTF16 based FST today, and we might want faster lookup and can afford more RAM used ... e.g. maybe Anyway this can come later :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I think we should push it, unless @dweiss really struggles to find another guinea pig PR for migrating patch across the upcoming fork :)
Maybe it is interesting for use-cases like kuromoji (japanese) and nori (korean) which also use FST (sometimes with hacky root-arc caching to improve performance) today. They have a similar use case to decompounder+stemmer as far as what they are doing. |
bq. Wow, why? :) Because I'm a perfectionist and it goes against my nature to show something I don't think is ideal? :) I know, it's bad. I'm trying to fight it, but it's not easy. :) |
Closed, ported PR: apache/lucene#2 |
LOL, OK I see, that's fair. It's great to strive to be a perfectionist! You get even more progress this way! Maybe we needed dedicated Psychiatrists (Psychologists?) who know how to help us with the unique struggles in open-source software! |
All I need is you to remind me about that perfection vs. progress thing... :) |
Here's another pithy phrase for you: don't let the perfect be the enemy of
the good
…On Wed, Mar 10, 2021, 7:33 AM Dawid Weiss ***@***.***> wrote:
All I need is you to remind me about that perfection vs. progress thing...
:)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2459 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHHUQJVAXBG7E3OFEDQUTLTC5RKLANCNFSM4YVMF3VA>
.
|
Description
FST lookup and especially traversal are too slow
Solution
Replace FST with a similarly-shaped trie-like data structure, but with nodes referring to their parents instead of their children, plus a hash map into its leaves. I suspect such data structure has a name, but couldn't find it.
Tests
No new tests, but this makes some notable performance-related changes in existing ones:
~25% speedup in
TestAllDictionaries
: no word FST compiling during dictionary loading now~35% speedup in
TestPerformance
: lookup/enumeration have become faster; stemming/spellchecking/suggestions are affected to varying degrees, e.g. 45-50% speedup for EnglishBut this has some cost: ~70% more memory on average as printed by
TestAllDictionaries
on 178 wooorm/libre dictionaries.Checklist
Please review the following and check all that apply:
master
branch../gradlew check
.