Too much time for loading large dictionary #31

jalajthanaki · 2019-02-15T09:58:15Z

Hi Symspellpy team,

Amazing tool....! Your team did fabulous job.

I have following queries.

Is there any detail documentation for symspellpy?
I have a big dictionary having more than 4M words in it. It's taking too much time to load the dictionary. Is there any ways that I can optimize the loading time. I have attached the screenshots as well which can help you to understand the problem.

This screenshot is for 82,765 words load time

This screenshot is for 4M words load time

Any help or suggestion will be helpful. Even detail explanation of each method help me. I will look forward to collaborating to symspell code if needed.

mammothb · 2019-02-15T10:09:42Z

I have included the documentation in the source code itself. However, I can look into creating a proper documentation for the package.
The original code uses SuggestionStage to cut down load. However, I have yet to figure out how to implement that in python. If you are able to implement something similar, I will be glad to include it in the package.

mzeidhassan · 2019-02-15T17:51:51Z

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

jalajthanaki · 2019-02-16T04:42:05Z

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

@mzeidhassan Thank you for your suggestion. I did experiment with loading dictionary in pickle format but the actual time-consuming process is to build the pre-calculated delete combination for large dictionary. I'm thinking how I can save the pre-calculated deletions and reuse them again by putting them into any in-memory database. See this interesting post.

@mammothb and @mzeidhassan Do let me know if you want to add something. Correct me/add information if I'm missing something.

dongyuwei · 2019-02-16T14:36:18Z

@jalajthanaki I think LMDB is suitable for your use case:

With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

jalajthanaki · 2019-02-17T08:30:52Z

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

@dongyuwei I am planning to update my dictionary every 3 months. So totally agree with your suggestion. I will try LMDB and redis.

jalajthanaki · 2019-02-17T08:32:54Z

@mammothb I have tried to save delete combination as pickle and loading of pickle significantly improves the loading time for me. Is there any guideline for contributes?

Soon I will be adding the function(s) which can use in-memory database.

mzeidhassan · 2019-02-18T03:48:32Z

This is amazing, @jalajthanaki ! Thank you so much for helping. This is a great contribution to symspellpy.

mzeidhassan · 2019-02-18T17:23:31Z

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

jalajthanaki · 2019-02-18T17:26:04Z

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

I will be doing it by tomorrow EOD.

Meanwhile refer the following code

Step 1: The following code is for when we are generating the symspell delete combination first time

    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return
    else:
        sym_spell.save_pickle('symspell.pkl')

Step 2: This is the code if we wanted to load saved pickle: sym_spell.load_pickle('symspell.pkl')

mzeidhassan · 2019-02-18T17:37:17Z

Thanks a million @jalajthanaki ! Appreciate it.

dongyuwei · 2019-02-19T05:17:15Z

@jalajthanaki json is faster than pickle, according to this https://konstantin.blog/2010/pickle-vs-json-which-is-faster/, I did not test/verify it, but maybe it is helpful for this project.

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

Or, as @wolfgarbe suggested, use flatbuffers, since it can access to serialized data without parsing/unpacking.

kakaroto · 2019-02-23T01:27:14Z

I just did some tests with the 82k word file:

Function	Distance	Average time (seconds)
load_dictionary	2	5.7
save_picke	2	14.5
load_pickle	2	1.5
save_json	2	7.1
load_json	2	1.4
load_dictionary	5	23.5
save_picke	5	37.8
load_pickle	5	2.35
save_json	5	33.5
load_json	5	3.1

I didn't try flatbuffers as @dongyuwei suggested because it seemed more complex to integrate than just replacing pickle with json.
My conclusion here is that json is not particularly faster than pickle in this use case, and loading is still pretty slow. Hopefully the SuggestionStage can be implemented at some point, but it's irrelevant to loading the pickle/json file. The problem there is that the amount of data is too massive (I tried compact_level=10, it didn't make a lot of difference).

mammothb · 2019-02-23T02:31:10Z

@kakaroto I tried implementing SuggestionStage (not pushed), but it turned out taking longer than the current code. I assume it is because preallocating memory doesn't work with python

kakaroto · 2019-02-23T09:09:04Z

I think there are ways to preallocate memory in python, it depends though on how you're going to use it, it might not work. If you want to share that code in a separate branch, I could have a look tomorrow and see if I can test/figure out a way to speed it up.

By the way, I would suggest to also store the max distance in the pickle dictionary because the deletes/words dictionary is directly related to the distance. I've been generating different dictionaries with various distances (and therefore different file sizes) but they all have the same suggestions because the SymSpell object I create always has a distance of 2, so the pickle file content is irrelevant. I'm not sure though if it's a good idea to change a user-provided setting when loading a dictionary, maybe only do that if the SymSpell object is created without specifying a distance, but keep the distance fixed if that argument is passed to __init__?

mammothb · 2019-02-25T02:11:31Z

@kakaroto i have created a new branch with an initial implementation of SuggestionStage

kakaroto · 2019-02-26T06:20:34Z

Ah, thanks @mammothb! I see the problem and there's definitely some improvement that can be made here but the SuggestionStage is not the way to go.

Looking at what the CS code does, I see this on every delete insertion :


                if (deletes.TryGetValue(deleteHash, out string[] suggestions))
                {
                    var newSuggestions = new string[suggestions.Length + 1];
                    Array.Copy(suggestions, newSuggestions, suggestions.Length);
                    deletes[deleteHash] = suggestions = newSuggestions;
                }
                else
                {
                    suggestions = new string[1];
                    deletes.Add(deleteHash, suggestions);
                }
                suggestions[suggestions.Length - 1] = key;

That's indeed very time consuming, create a string array then copy the old array into the new one then add the new entry to it. So the suggestion stage makes this go much faster by doing this :


            if (!Deletes.TryGetValue(deleteHash, out Entry entry)) entry = new Entry { count = 0, first = -1 };
            int next = entry.first;
            entry.count++;
            entry.first = Nodes.Count;
            Deletes[deleteHash] = entry;
            Nodes.Add(new Node { suggestion = suggestion, next = next });

Which is basically: Create a linked list and append to the list instead then convert to a string array only at the end.

Now in the Python implementation you're not creating arrays, you're already using a linked list and call .append to it so you were already kind of using a 'suggestion stage' because you're using a linked list which is already very fast for insertions.
What you did instead is replace this code :

            for delete in edits:
                delete_hash = self._get_str_hash(delete)
                self._deletes[delete_hash].append(key)

with this :

            for delete in edits:
                staging.add(self._get_str_hash(delete), key)

So you are already looping through the same amount of edits, and calling the same get_str_hash on every one and whatever that staging.add() does, if it's not faster than the already very fast python list.append function, then it's of course going to be a slower implementation and not be helpful.

So there's a couple of things I noticed. First, the CS implementation uses a <int, string[]> dictionary, and since they use an int, that's why they need to hash the delete string, hence the self._get_str_hash, but you don't actually need to do that because a python dictionary is already a hash table, so you can make it into a dictionary of strings and it will be just as fast as a dictionary of ints.
The second thing is that the custom hash function you wrote cannot be faster (or as good/complete) than python's optimized hash function. To confirm, I ran it 10 million times and timed it and python's hash function is 10 times faster than the custom implementation

  a = time.time()
  for i in range(10000000):
     s._get_str_hash("foobar")
  b = time.time(); print(b - a)

13.685925960540771

  a = time.time()
  for i in range(10000000):
     hash("foobar")
  b = time.time(); print(b - a)
1.3400061130523682

So I made this simple change, remove the calls to _str_get_hash, and make that dictionary just take the string directly :

        for delete in edits:
            self._deletes[delete].append(key)

and in lookup, do the same :

            if candidate in self._deletes:
                dict_suggestions = self._deletes[candidate]

And then I tested the speeds again (average of 10 runs) :

Function	Hash	Distance	Average time (seconds)
load_dictionary	_get_str_hash	2	5.7
load_dictionary	_get_str_hash	5	23.5
load_dictionary	python hash	2	3.4
load_dictionary	python hash	5	14.1

So this simple change gives us about 40% speed improvement. I tried a lookup for "helo" and I got the same results and no crash, so yeay! But I haven't done any extensive testing on this, and I'm sure there are many other ways that we can use to make it more efficient. (Note, this would break the pickle. It would be good to add a 'version' in the pickle to make sure you only load from the same version of the internal data).

This is the diff I have for now (I didn't clone/push because I haven't tested it extensively but once I look at the rest of the code and see how it can be improved further, then I'll make a proper patch and send a PR, but I had a quick look at the _edits function, it's gonna be hard to optimize that):

diff -ru /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py symspellpy_en/symspellpy.py
--- /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py     2019-02-22 20:18:43.693372932 -0500
+++ symspellpy_en/symspellpy.py 2019-02-26 00:35:26.881581236 -0500
@@ -136,8 +136,7 @@
         # create deletes
         edits = self._edits_prefix(key)
         for delete in edits:
-            delete_hash = self._get_str_hash(delete)
-            self._deletes[delete_hash].append(key)
+            self._deletes[delete].append(key)
         return True

     def load_dictionary(self, corpus, term_index, count_index,
@@ -328,8 +327,8 @@
                     continue
                 break

-            if self._get_str_hash(candidate) in self._deletes:
-                dict_suggestions = self._deletes[self._get_str_hash(candidate)]
+            if candidate in self._deletes:
+                dict_suggestions = self._deletes[candidate]
                 for suggestion in dict_suggestions:
                     if suggestion == phrase:
                         continue

kakaroto · 2019-02-26T17:28:09Z

By the way I'm happy to see that the ~40% speed improvement in removing _get_str_hash is consistent with @jalajthanaki 's profiling timing that showed the function taking 36% and 37.5% of the time, and the builtin hash() isn't adding any time because it was already being called when inserting into the dictionary, (though it was being called on the str_hash itself).
I've also found out that python's list is not actually a linked list but rather an array but with a pre-allocated size and its append happens in O(1).

A quick little update as well (using frequency_dictionary_en_82_765.txt and the optimized code) :

Function	gzipped	Distance	Size (MB)	Average time (seconds)
load_dictionary	-	2	1.3 MB	3.4
save_picke	True	2	8.6 MB	9.5
load_pickle	True	2	8.6 MB	1.2
save_picke	False	2	26 MB	0.6
load_pickle	False	2	26 MB	0.9
load_dictionary	-	5	1.3 MB	14.1
save_picke	True	5	21 MB	37.8
load_pickle	True	5	21 MB	2.35
save_picke	False	5	54 MB	1.0
load_pickle	False	5	54 MB	1.5

So it looks like the biggest time consumption in loading/saving the pickle is the compression which sounds obvious now that I think about it. I also tested json and it's also taking longer with non-gzipped version (and higher file sizes) when compared to pickle.

Note that the optimized code makes loading/saving the pickle faster too (9.5 vs 14.5s to save, 1.2 vs 1.5s to load) probably because the length of the hash was generally longer than the length of the average word.

For reference, here's another test done with aspell's en_US dictionary which contains 120400 words (more words but smaller file size and faster to load/save) :

Function	gzipped	Distance	Average time (seconds)
load_dictionary	-	2	3.0
save_picke	True	2	9.3
load_pickle	True	2	2.0
save_picke	False	2	0.65
load_pickle	False	2	1.2
save_json	False	2	2.3
load_json	False	2	2.2

And for some more timing test, in create_dictionary_entry, if I do :

        edits = self._edits_prefix(key)
        edits = []

The time to load drops from 3.0s to 1.8s, and if I change it to :

        #edits = self._edits_prefix(key)
        edits = []

It drops to 0.13s, which is again more or less consistent with the profiling information of about 20% in both the call to self._edits and in the create_dictionary_entry (where most of the time is spent looping on those edits, inserting them in the dictionary).

So we currently have more than 50% of the new time spent on these 2 lines in create_dictionary_entry

        for delete in edits:
            self._deletes[delete].append(key)

There isn't a lot that we can do to improve those I think. I think there's a better chance at improving the _edits method, but I haven't been able to find the right way to do it.

By the way, if someone has another dictionary to provide (@jalajthanaki you have one with 4M words?) I can test with that as well.

mammothb · 2019-03-19T04:52:09Z

@kakaroto any updates on the patch? I could implement the suggested improvement if you're busy with other things

kakaroto · 2019-03-19T04:58:21Z

@mammothb sorry, I've been busy. I've spent some time trying to optimize the rest of it, but unfortunately unfruitfully.
I should have time to take care of it this week then send the patch. I want to add the versioning of the pickle as well to avoid loading a pickle made with the old format (hash instead of string in the deletes dict). I think it should just reject the old version, making it throw an error. I wonder if that's an acceptable solution (should be possible to just re-create the deletes table from the words dictionary of the pickle, but with no saves in loading time, so it defeats the purpose, I can do that if you prefer).

mammothb · 2019-03-19T05:01:39Z

@kakaroto I think it's fine to just reject the old version and have the users manually create a new version themselves.

As mentionned in a comment on issue mammothb#31 [1], the creating of a custom hashing algorithm is too slow when compared to the native string hashing algorithm of python. Instead of having a dictionary of ints, we just keep a dictionary of strings and this has no impact on functionality but improves the overall performance of loading a dictionary by about 40% This also adds a 'data_version' key to the pickle dictionary, which starts at 2 and load_pickle returns False if the data version is incompatible. [1] mammothb#31 (comment)

kakaroto · 2019-03-21T06:04:13Z

I've sent the pull request, let me know if there's anything more I can do! Thanks.

jalajthanaki mentioned this issue Feb 16, 2019

Dictionary loading/lookup optimization wolfgarbe/SymSpell#16

Closed

jalajthanaki mentioned this issue Feb 17, 2019

Dev #32

Closed

kakaroto mentioned this issue Mar 21, 2019

Improve performance by reusing python builtin hashing of string dict #38

Merged

kakaroto mentioned this issue Mar 25, 2019

Refactor spellchecker code and add other spellcheck library support olivierkes/manuskript#507

Merged

mammothb mentioned this issue Sep 13, 2019

How to improve spell correction accuracy #58

Closed

mammothb mentioned this issue Jul 28, 2021

Bigrams omitted from save and load pickle methods #93

Closed

mammothb added the enhancement New feature or request label Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too much time for loading large dictionary #31

Too much time for loading large dictionary #31

jalajthanaki commented Feb 15, 2019

mammothb commented Feb 15, 2019

mzeidhassan commented Feb 15, 2019

jalajthanaki commented Feb 16, 2019

dongyuwei commented Feb 16, 2019

jalajthanaki commented Feb 17, 2019

jalajthanaki commented Feb 17, 2019

mzeidhassan commented Feb 18, 2019 •

edited

Loading

mzeidhassan commented Feb 18, 2019

jalajthanaki commented Feb 18, 2019 •

edited

Loading

mzeidhassan commented Feb 18, 2019

dongyuwei commented Feb 19, 2019

kakaroto commented Feb 23, 2019

mammothb commented Feb 23, 2019

kakaroto commented Feb 23, 2019 •

edited

Loading

mammothb commented Feb 25, 2019

kakaroto commented Feb 26, 2019

kakaroto commented Feb 26, 2019

mammothb commented Mar 19, 2019

kakaroto commented Mar 19, 2019

mammothb commented Mar 19, 2019

kakaroto commented Mar 21, 2019

Too much time for loading large dictionary #31

Too much time for loading large dictionary #31

Comments

jalajthanaki commented Feb 15, 2019

mammothb commented Feb 15, 2019

mzeidhassan commented Feb 15, 2019

jalajthanaki commented Feb 16, 2019

dongyuwei commented Feb 16, 2019

jalajthanaki commented Feb 17, 2019

jalajthanaki commented Feb 17, 2019

mzeidhassan commented Feb 18, 2019 • edited Loading

mzeidhassan commented Feb 18, 2019

jalajthanaki commented Feb 18, 2019 • edited Loading

mzeidhassan commented Feb 18, 2019

dongyuwei commented Feb 19, 2019

kakaroto commented Feb 23, 2019

mammothb commented Feb 23, 2019

kakaroto commented Feb 23, 2019 • edited Loading

mammothb commented Feb 25, 2019

kakaroto commented Feb 26, 2019

kakaroto commented Feb 26, 2019

mammothb commented Mar 19, 2019

kakaroto commented Mar 19, 2019

mammothb commented Mar 19, 2019

kakaroto commented Mar 21, 2019

mzeidhassan commented Feb 18, 2019 •

edited

Loading

jalajthanaki commented Feb 18, 2019 •

edited

Loading

kakaroto commented Feb 23, 2019 •

edited

Loading