Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9825: Hunspell: reverse the "words" trie for faster word lookup/suggestions #2459

Closed
wants to merge 8 commits into from

Conversation

donnerpeter
Copy link
Contributor

@donnerpeter donnerpeter commented Mar 5, 2021

Description

FST lookup and especially traversal are too slow

Solution

Replace FST with a similarly-shaped trie-like data structure, but with nodes referring to their parents instead of their children, plus a hash map into its leaves. I suspect such data structure has a name, but couldn't find it.

Tests

No new tests, but this makes some notable performance-related changes in existing ones:
~25% speedup in TestAllDictionaries: no word FST compiling during dictionary loading now
~35% speedup in TestPerformance: lookup/enumeration have become faster; stemming/spellchecking/suggestions are affected to varying degrees, e.g. 45-50% speedup for English
But this has some cost: ~70% more memory on average as printed by TestAllDictionaries on 178 wooorm/libre dictionaries.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the master branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Ref Guide (for Solr changes only).

/** A Bloom filter over {@link #words} to avoid unnecessary expensive FST traversals */
FixedBitSet wordHashes;
/** The entries in the .dic file, mapping to their set of flags */
WordStorage words;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the word information into a dedicated class

if (fst == null) {
return null;
}
private IntsRef lookup(FST<IntsRef> fst, char[] word) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now queries are always on the entire array, and the fst is not-null. This code is called only from a single test class and might deserve moving there.

@@ -1275,76 +1258,6 @@ boolean isDotICaseChangeDisallowed(char[] word) {
return word[0] == 'İ' && !alternateCasing;
}

private class EntryGrouper {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same-word-grouping logic in this class has been moved to WordStorage.Builder, and appended with building the reverse trie

@@ -234,6 +234,8 @@ private void tryLongSwap(String word) {
}

private void tryRemovingChar(String word) {
if (word.length() == 1) return;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't accept empty words for lookup anymore

@@ -94,6 +94,10 @@ public Stemmer(Dictionary dictionary) {
}

List<CharsRef> list = new ArrayList<>();
if (length == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't accept empty words for lookup anymore, again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL empty string.

@rmuir
Copy link
Member

rmuir commented Mar 5, 2021

I like the idea, optimize for the operations hunspell needs to do. do you happen to have any more details on the memory usage increase (e.g. by language)? I imagine things are ok, since there is still prefix compression, but just want to confirm nothing really explodes in a bad way.

@donnerpeter
Copy link
Contributor Author

donnerpeter commented Mar 5, 2021

The largest dictionaries in my collection:
el_GR: 1.7 MB -> 11.1 MB
gd: 14.5 MB -> 18.3 MB
fy: 36.7 MB -> 39.3 MB
rw: remains 18.8 MB (because it's all affixes, not words)

The most common (probably):
en_US: 409 KB -> 712 KB
de_DE: 2.5 MB -> 4.7 MB
fr: 830 KB -> 1.4 MB

Total for 178 dictionaries (some of them duplicate): 274 MB -> 463 MB

So it's increasing, but I wouldn't call it an explosion. Greek seems to grow the most, but for me that'd be acceptable. What about you?

@rmuir
Copy link
Member

rmuir commented Mar 5, 2021

greek dictionary has 828806 words, so the hash table alone is 3.2MB, right?

as far as encoding of actual words (n * VINT: the word form data), maybe it is possible to explore n * ZINT, where you encode deltas from previously encoded character? Then this way, writing systems like greek, russian would typically only use single byte/char, because of how they are organized in unicode blocks. See {{readZInt()}} for more information.

@rmuir
Copy link
Member

rmuir commented Mar 5, 2021

For greek, if you analyze the distribution of dictionary (I use https://scripts.sil.org/UnicodeCharacterCount ), you can see that smallest character in the whole dictionary is 0x386 (decimal 902) and largest is 0x3CE (decimal 974). So, even simpler, you could exploit that and encode a single int base = 0x386; // smallest char in use for this whole dictionary and all characters would be single-byte encoded.

@rmuir
Copy link
Member

rmuir commented Mar 5, 2021

a lot of these dictionaries are encoded in legacy charsets which explains why they conform so well to their unicode block, e.g. zero stray ASCII or chinese or whatever in the greek one...

@donnerpeter
Copy link
Contributor Author

A great idea about the base char! It's much simpler than what I had in mind if Greek becomes a problem :) Should I do this in this PR or in another one?

@donnerpeter
Copy link
Contributor Author

I've found a regression not covered by unit tests, will fix it (probably on Monday), please don't merge yet :)

@rmuir
Copy link
Member

rmuir commented Mar 5, 2021

we can definitely do this stuff in another PR! I'm just throwing out wacky ideas to try to help out.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome!

Many of my comments are confused because I did not really understand the change and was commenting too soon ;) Sorry.

And none of my comments should block pushing this -- this looks like major performance improvement, and our tests should catch any regressions here. Progress not perfection!

This looks like a faster / less compact Map<char[],int[]> than FST, and looks maybe loosely inspired by FST except 1) it looks up entries in reverse order (last char first), 2) it has hash map to directly jump to the tail of each entry, and 3) does not share suffixes of all words, so no RAM hungry nodeHash (but does share common prefixes).

I would love to see this (later) factored out as its own dedicated data structure under oal.util for a nice alternative to FST.

while (true) {
in.setPosition(pos);
char c = (char) in.readVInt();
int prevPos = pos - in.readVInt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to fail-fast(er) here by checking if (c != lastChar && !collission) { return null; }? Might give small speedup for words not in the dictionary, though, that is likely the rare case (most lookups exist, or, share a suffix that exists)?

Copy link
Contributor Author

@donnerpeter donnerpeter Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but I've tried that and it doesn't seem to make any difference to performance.

skipVInts(in, formLength);
}

collision = in.readByte() == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of burning whole byte to represent "next entry still has a collision", you could steal one bit from the delta pos you read next? It'd make this a bit more compact, and would be consistent with the up front pos < 0 collision check which is single bit encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(delta is a great word, I'll use it, thanks!)

Oh, appending the lowest bit to the delta looks nice, I haven't thought of this! However, I've another idea for using that byte: to store the word length, which might speed up lookup and especially enumeration. I plan to test whether this is true, and if not, will append the bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea (storing word length) -- that'd make lookups (that hashed to the same mod location but have different lengths) fail faster.

chars.chars[wordStart] = (char) in.readVInt();
int prevPos = pos - in.readVInt();

int dataLength = in.readVInt();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are willing to say "no more than 256 forms" you could make this a .readByte() which goes a bit faster than .readVInt, maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately some crazy dictionaries have numerous homonyms, like this:

Alberto po:antropónimo [n-grama: Alberto Arnoldi]
Alberto po:antropónimo [n-grama: Alberto Avendaño Prieto]
Alberto po:antropónimo [n-grama: Alberto Bazzoni]
Alberto po:antropónimo [n-grama: Alberto Benito]
Alberto po:antropónimo [n-grama: Alberto Burri]
...

The funny thing is that we don't even store/support this custom n-grama thing. But Hunspell stores it, so we might, too, in future. So I'm not sure we should collapse the duplicates just for this case right now. Your opinion?

}
}

int lastPos = commonPrefixPos;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add // write new suffix for this entry comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added something else :)

}

@Test
public void en_suggest() throws Exception {
checkSuggestionPerformance("en", 1_200);
checkSuggestionPerformance("en", 3_000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what are these magical constant numbers? And why does this change mean they should increase?

Copy link
Contributor Author

@donnerpeter donnerpeter Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the number of words to check in the test so that its duration is noticeable enough for profiling, but not too long for me to get bored when running it locally :) The change increases the speed of suggestions, so the test may now check more words in about the same time.

@donnerpeter
Copy link
Contributor Author

I would love to see this (later) factored out as its own dedicated data structure under oal.util for a nice alternative to FST.

@mikemccand It's certainly possible (especially if we come up with a nice name; so far the best I have is Reverse(d)Trie or PrefixSharingStringMap), but I'm not sure that'd be useful to anyone else, as it's quite specialized to Hunspell scenarios (and might become even more so with further optimizations). Do you have candidates in mind for using this structure?

@dweiss
Copy link
Contributor

dweiss commented Mar 8, 2021

bq. Progress not perfection!

I hate that quote, Mike. :)

@donnerpeter
Copy link
Contributor Author

For greek, if you analyze the distribution of dictionary (I use https://scripts.sil.org/UnicodeCharacterCount ), you can see that smallest character in the whole dictionary is 0x386 (decimal 902) and largest is 0x3CE (decimal 974). So, even simpler, you could exploit that and encode a single int base = 0x386; // smallest char in use for this whole dictionary and all characters would be single-byte encoded.

@rmuir This seems to bring memory usage a bit down, but not so much as I'd hope (11.1->9.4 MB for Greek, 463->458 MB total). As it also complicates the code, I'd leave this out, at least for now.

@donnerpeter
Copy link
Contributor Author

I've fixed the encountered bugs, added regression tests for them, and applied most of these great suggestions, thanks!

@donnerpeter
Copy link
Contributor Author

Hi, any chance to get this merged before the Great Repository Split?

@dweiss
Copy link
Contributor

dweiss commented Mar 9, 2021

Can I use this PR as an example of how to port a PR to the new repo? ;)

@donnerpeter
Copy link
Contributor Author

OK :)

@mikemccand
Copy link
Member

bq. Progress not perfection!

I hate that quote, Mike. :)

Wow, why? :)

@mikemccand
Copy link
Member

Do you have candidates in mind for using this structure?

Well, maybe anywhere we use a UTF16 based FST today, and we might want faster lookup and can afford more RAM used ... e.g. maybe SynonymGraphFilter?

Anyway this can come later :)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I think we should push it, unless @dweiss really struggles to find another guinea pig PR for migrating patch across the upcoming fork :)

@rmuir
Copy link
Member

rmuir commented Mar 9, 2021

Do you have candidates in mind for using this structure?

Well, maybe anywhere we use a UTF16 based FST today, and we might want faster lookup and can afford more RAM used ... e.g. maybe SynonymGraphFilter?

Anyway this can come later :)

Maybe it is interesting for use-cases like kuromoji (japanese) and nori (korean) which also use FST (sometimes with hacky root-arc caching to improve performance) today. They have a similar use case to decompounder+stemmer as far as what they are doing.

@dweiss
Copy link
Contributor

dweiss commented Mar 10, 2021

bq. Wow, why? :)

Because I'm a perfectionist and it goes against my nature to show something I don't think is ideal? :) I know, it's bad. I'm trying to fight it, but it's not easy. :)

@dweiss
Copy link
Contributor

dweiss commented Mar 10, 2021

Closed, ported PR: apache/lucene#2

@dweiss dweiss closed this Mar 10, 2021
@mikemccand
Copy link
Member

bq. Wow, why? :)

Because I'm a perfectionist and it goes against my nature to show something I don't think is ideal? :) I know, it's bad. I'm trying to fight it, but it's not easy. :)

LOL, OK I see, that's fair. It's great to strive to be a perfectionist! You get even more progress this way! Maybe we needed dedicated Psychiatrists (Psychologists?) who know how to help us with the unique struggles in open-source software!

@dweiss
Copy link
Contributor

dweiss commented Mar 10, 2021

All I need is you to remind me about that perfection vs. progress thing... :)

@msokolov
Copy link
Contributor

msokolov commented Mar 11, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants