Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much time for loading large dictionary #31

Open
jalajthanaki opened this issue Feb 15, 2019 · 21 comments
Open

Too much time for loading large dictionary #31

jalajthanaki opened this issue Feb 15, 2019 · 21 comments
Labels
enhancement New feature or request

Comments

@jalajthanaki
Copy link

Hi Symspellpy team,

Amazing tool....! Your team did fabulous job.

I have following queries.

  1. Is there any detail documentation for symspellpy?
  2. I have a big dictionary having more than 4M words in it. It's taking too much time to load the dictionary. Is there any ways that I can optimize the loading time. I have attached the screenshots as well which can help you to understand the problem.

This screenshot is for 82,765 words load time
This screenshot is for 82,765 words

This screenshot is for 4M words load time
This  screenshot is for  4M words

Any help or suggestion will be helpful. Even detail explanation of each method help me. I will look forward to collaborating to symspell code if needed.

@mammothb
Copy link
Owner

  1. I have included the documentation in the source code itself. However, I can look into creating a proper documentation for the package.
  2. The original code uses SuggestionStage to cut down load. However, I have yet to figure out how to implement that in python. If you are able to implement something similar, I will be glad to include it in the package.

@mzeidhassan
Copy link

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

@jalajthanaki
Copy link
Author

Just a suggestion:
wouldn't pickle make things much faster to load? i.e. saving the frequency dictionary to a pickle, then load it to use?

@mzeidhassan Thank you for your suggestion. I did experiment with loading dictionary in pickle format but the actual time-consuming process is to build the pre-calculated delete combination for large dictionary. I'm thinking how I can save the pre-calculated deletions and reuse them again by putting them into any in-memory database. See this interesting post.

@mammothb and @mzeidhassan Do let me know if you want to add something. Correct me/add information if I'm missing something.

@dongyuwei
Copy link

@jalajthanaki I think LMDB is suitable for your use case:

With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

@jalajthanaki
Copy link
Author

I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once.

@dongyuwei I am planning to update my dictionary every 3 months. So totally agree with your suggestion. I will try LMDB and redis.

@jalajthanaki
Copy link
Author

@mammothb I have tried to save delete combination as pickle and loading of pickle significantly improves the loading time for me. Is there any guideline for contributes?

Soon I will be adding the function(s) which can use in-memory database.

@jalajthanaki jalajthanaki mentioned this issue Feb 17, 2019
Closed
@mzeidhassan
Copy link

mzeidhassan commented Feb 18, 2019

This is amazing, @jalajthanaki ! Thank you so much for helping. This is a great contribution to symspellpy.

@mzeidhassan
Copy link

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

@jalajthanaki
Copy link
Author

jalajthanaki commented Feb 18, 2019

@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly?

I will be doing it by tomorrow EOD.

Meanwhile refer the following code

Step 1: The following code is for when we are generating the symspell delete combination first time

    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return
    else:
        sym_spell.save_pickle('symspell.pkl')

Step 2: This is the code if we wanted to load saved pickle: sym_spell.load_pickle('symspell.pkl')

@mzeidhassan
Copy link

Thanks a million @jalajthanaki ! Appreciate it.

@dongyuwei
Copy link

@jalajthanaki json is faster than pickle, according to this https://konstantin.blog/2010/pickle-vs-json-which-is-faster/, I did not test/verify it, but maybe it is helpful for this project.

JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

Or, as @wolfgarbe suggested, use flatbuffers, since it can access to serialized data without parsing/unpacking.

@kakaroto
Copy link
Contributor

I just did some tests with the 82k word file:

Function Distance Average time (seconds)
load_dictionary 2 5.7
save_picke 2 14.5
load_pickle 2 1.5
save_json 2 7.1
load_json 2 1.4
load_dictionary 5 23.5
save_picke 5 37.8
load_pickle 5 2.35
save_json 5 33.5
load_json 5 3.1

I didn't try flatbuffers as @dongyuwei suggested because it seemed more complex to integrate than just replacing pickle with json.
My conclusion here is that json is not particularly faster than pickle in this use case, and loading is still pretty slow. Hopefully the SuggestionStage can be implemented at some point, but it's irrelevant to loading the pickle/json file. The problem there is that the amount of data is too massive (I tried compact_level=10, it didn't make a lot of difference).

@mammothb
Copy link
Owner

@kakaroto I tried implementing SuggestionStage (not pushed), but it turned out taking longer than the current code. I assume it is because preallocating memory doesn't work with python

@kakaroto
Copy link
Contributor

kakaroto commented Feb 23, 2019

I think there are ways to preallocate memory in python, it depends though on how you're going to use it, it might not work. If you want to share that code in a separate branch, I could have a look tomorrow and see if I can test/figure out a way to speed it up.

By the way, I would suggest to also store the max distance in the pickle dictionary because the deletes/words dictionary is directly related to the distance. I've been generating different dictionaries with various distances (and therefore different file sizes) but they all have the same suggestions because the SymSpell object I create always has a distance of 2, so the pickle file content is irrelevant. I'm not sure though if it's a good idea to change a user-provided setting when loading a dictionary, maybe only do that if the SymSpell object is created without specifying a distance, but keep the distance fixed if that argument is passed to __init__?

@mammothb
Copy link
Owner

@kakaroto i have created a new branch with an initial implementation of SuggestionStage

@kakaroto
Copy link
Contributor

Ah, thanks @mammothb! I see the problem and there's definitely some improvement that can be made here but the SuggestionStage is not the way to go.

Looking at what the CS code does, I see this on every delete insertion :


                if (deletes.TryGetValue(deleteHash, out string[] suggestions))
                {
                    var newSuggestions = new string[suggestions.Length + 1];
                    Array.Copy(suggestions, newSuggestions, suggestions.Length);
                    deletes[deleteHash] = suggestions = newSuggestions;
                }
                else
                {
                    suggestions = new string[1];
                    deletes.Add(deleteHash, suggestions);
                }
                suggestions[suggestions.Length - 1] = key;

That's indeed very time consuming, create a string array then copy the old array into the new one then add the new entry to it. So the suggestion stage makes this go much faster by doing this :


            if (!Deletes.TryGetValue(deleteHash, out Entry entry)) entry = new Entry { count = 0, first = -1 };
            int next = entry.first;
            entry.count++;
            entry.first = Nodes.Count;
            Deletes[deleteHash] = entry;
            Nodes.Add(new Node { suggestion = suggestion, next = next });

Which is basically: Create a linked list and append to the list instead then convert to a string array only at the end.

Now in the Python implementation you're not creating arrays, you're already using a linked list and call .append to it so you were already kind of using a 'suggestion stage' because you're using a linked list which is already very fast for insertions.
What you did instead is replace this code :

            for delete in edits:
                delete_hash = self._get_str_hash(delete)
                self._deletes[delete_hash].append(key)

with this :

            for delete in edits:
                staging.add(self._get_str_hash(delete), key)

So you are already looping through the same amount of edits, and calling the same get_str_hash on every one and whatever that staging.add() does, if it's not faster than the already very fast python list.append function, then it's of course going to be a slower implementation and not be helpful.

So there's a couple of things I noticed. First, the CS implementation uses a <int, string[]> dictionary, and since they use an int, that's why they need to hash the delete string, hence the self._get_str_hash, but you don't actually need to do that because a python dictionary is already a hash table, so you can make it into a dictionary of strings and it will be just as fast as a dictionary of ints.
The second thing is that the custom hash function you wrote cannot be faster (or as good/complete) than python's optimized hash function. To confirm, I ran it 10 million times and timed it and python's hash function is 10 times faster than the custom implementation

  a = time.time()
  for i in range(10000000):
     s._get_str_hash("foobar")
  b = time.time(); print(b - a)

13.685925960540771

  a = time.time()
  for i in range(10000000):
     hash("foobar")
  b = time.time(); print(b - a)
1.3400061130523682

So I made this simple change, remove the calls to _str_get_hash, and make that dictionary just take the string directly :

        for delete in edits:
            self._deletes[delete].append(key)

and in lookup, do the same :

            if candidate in self._deletes:
                dict_suggestions = self._deletes[candidate]

And then I tested the speeds again (average of 10 runs) :

Function Hash Distance Average time (seconds)
load_dictionary _get_str_hash 2 5.7
load_dictionary _get_str_hash 5 23.5
load_dictionary python hash 2 3.4
load_dictionary python hash 5 14.1

So this simple change gives us about 40% speed improvement. I tried a lookup for "helo" and I got the same results and no crash, so yeay! But I haven't done any extensive testing on this, and I'm sure there are many other ways that we can use to make it more efficient. (Note, this would break the pickle. It would be good to add a 'version' in the pickle to make sure you only load from the same version of the internal data).

This is the diff I have for now (I didn't clone/push because I haven't tested it extensively but once I look at the rest of the code and see how it can be improved further, then I'll make a proper patch and send a PR, but I had a quick look at the _edits function, it's gonna be hard to optimize that):

diff -ru /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py symspellpy_en/symspellpy.py
--- /usr/local/lib/python3.6/site-packages/symspellpy/symspellpy.py     2019-02-22 20:18:43.693372932 -0500
+++ symspellpy_en/symspellpy.py 2019-02-26 00:35:26.881581236 -0500
@@ -136,8 +136,7 @@
         # create deletes
         edits = self._edits_prefix(key)
         for delete in edits:
-            delete_hash = self._get_str_hash(delete)
-            self._deletes[delete_hash].append(key)
+            self._deletes[delete].append(key)
         return True

     def load_dictionary(self, corpus, term_index, count_index,
@@ -328,8 +327,8 @@
                     continue
                 break

-            if self._get_str_hash(candidate) in self._deletes:
-                dict_suggestions = self._deletes[self._get_str_hash(candidate)]
+            if candidate in self._deletes:
+                dict_suggestions = self._deletes[candidate]
                 for suggestion in dict_suggestions:
                     if suggestion == phrase:
                         continue

@kakaroto
Copy link
Contributor

By the way I'm happy to see that the ~40% speed improvement in removing _get_str_hash is consistent with @jalajthanaki 's profiling timing that showed the function taking 36% and 37.5% of the time, and the builtin hash() isn't adding any time because it was already being called when inserting into the dictionary, (though it was being called on the str_hash itself).
I've also found out that python's list is not actually a linked list but rather an array but with a pre-allocated size and its append happens in O(1).

A quick little update as well (using frequency_dictionary_en_82_765.txt and the optimized code) :

Function gzipped Distance Size (MB) Average time (seconds)
load_dictionary - 2 1.3 MB 3.4
save_picke True 2 8.6 MB 9.5
load_pickle True 2 8.6 MB 1.2
save_picke False 2 26 MB 0.6
load_pickle False 2 26 MB 0.9
load_dictionary - 5 1.3 MB 14.1
save_picke True 5 21 MB 37.8
load_pickle True 5 21 MB 2.35
save_picke False 5 54 MB 1.0
load_pickle False 5 54 MB 1.5

So it looks like the biggest time consumption in loading/saving the pickle is the compression which sounds obvious now that I think about it. I also tested json and it's also taking longer with non-gzipped version (and higher file sizes) when compared to pickle.

Note that the optimized code makes loading/saving the pickle faster too (9.5 vs 14.5s to save, 1.2 vs 1.5s to load) probably because the length of the hash was generally longer than the length of the average word.

For reference, here's another test done with aspell's en_US dictionary which contains 120400 words (more words but smaller file size and faster to load/save) :

Function gzipped Distance Average time (seconds)
load_dictionary - 2 3.0
save_picke True 2 9.3
load_pickle True 2 2.0
save_picke False 2 0.65
load_pickle False 2 1.2
save_json False 2 2.3
load_json False 2 2.2

And for some more timing test, in create_dictionary_entry, if I do :

        edits = self._edits_prefix(key)
        edits = []

The time to load drops from 3.0s to 1.8s, and if I change it to :

        #edits = self._edits_prefix(key)
        edits = []

It drops to 0.13s, which is again more or less consistent with the profiling information of about 20% in both the call to self._edits and in the create_dictionary_entry (where most of the time is spent looping on those edits, inserting them in the dictionary).

So we currently have more than 50% of the new time spent on these 2 lines in create_dictionary_entry

        for delete in edits:
            self._deletes[delete].append(key)

There isn't a lot that we can do to improve those I think. I think there's a better chance at improving the _edits method, but I haven't been able to find the right way to do it.

By the way, if someone has another dictionary to provide (@jalajthanaki you have one with 4M words?) I can test with that as well.

@mammothb
Copy link
Owner

@kakaroto any updates on the patch? I could implement the suggested improvement if you're busy with other things

@kakaroto
Copy link
Contributor

@mammothb sorry, I've been busy. I've spent some time trying to optimize the rest of it, but unfortunately unfruitfully.
I should have time to take care of it this week then send the patch. I want to add the versioning of the pickle as well to avoid loading a pickle made with the old format (hash instead of string in the deletes dict). I think it should just reject the old version, making it throw an error. I wonder if that's an acceptable solution (should be possible to just re-create the deletes table from the words dictionary of the pickle, but with no saves in loading time, so it defeats the purpose, I can do that if you prefer).

@mammothb
Copy link
Owner

@kakaroto I think it's fine to just reject the old version and have the users manually create a new version themselves.

kakaroto added a commit to kakaroto/symspellpy that referenced this issue Mar 21, 2019
As mentionned in a comment on issue mammothb#31 [1], the creating of a custom
hashing algorithm is too slow when compared to the native string hashing
algorithm of python. Instead of having a dictionary of ints, we just keep
a dictionary of strings and this has no impact on functionality but improves
the overall performance of loading a dictionary by about 40%

This also adds a 'data_version' key to the pickle dictionary, which starts
at 2 and load_pickle returns False if the data version is incompatible.

[1] mammothb#31 (comment)
@kakaroto
Copy link
Contributor

I've sent the pull request, let me know if there's anything more I can do! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants