-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too much time for loading large dictionary #31
Comments
|
Just a suggestion: |
@mzeidhassan Thank you for your suggestion. I did experiment with loading dictionary in pickle format but the actual time-consuming process is to build the pre-calculated delete combination for large dictionary. I'm thinking how I can save the pre-calculated deletions and reuse them again by putting them into any in-memory database. See this interesting post. @mammothb and @mzeidhassan Do let me know if you want to add something. Correct me/add information if I'm missing something. |
@jalajthanaki I think LMDB is suitable for your use case:
I had tried redis and leveldb, LMDB is more suitable to work with symspell. If your dictionary does not updated frequently, you can build the symspell delete combination once. |
@dongyuwei I am planning to update my dictionary every 3 months. So totally agree with your suggestion. I will try LMDB and redis. |
@mammothb I have tried to save delete combination as pickle and loading of pickle significantly improves the loading time for me. Is there any guideline for contributes? Soon I will be adding the function(s) which can use in-memory database. |
This is amazing, @jalajthanaki ! Thank you so much for helping. This is a great contribution to symspellpy. |
@mammothb I see that you merged the code in the latest version. Are you going to update the readme file and instructions accordingly? |
I will be doing it by tomorrow EOD. Meanwhile refer the following code Step 1: The following code is for when we are generating the symspell delete combination first time
Step 2: This is the code if we wanted to load saved pickle: |
Thanks a million @jalajthanaki ! Appreciate it. |
@jalajthanaki json is faster than pickle, according to this https://konstantin.blog/2010/pickle-vs-json-which-is-faster/, I did not test/verify it, but maybe it is helpful for this project.
Or, as @wolfgarbe suggested, use flatbuffers, since it can |
I just did some tests with the 82k word file:
I didn't try flatbuffers as @dongyuwei suggested because it seemed more complex to integrate than just replacing pickle with json. |
@kakaroto I tried implementing SuggestionStage (not pushed), but it turned out taking longer than the current code. I assume it is because preallocating memory doesn't work with python |
I think there are ways to preallocate memory in python, it depends though on how you're going to use it, it might not work. If you want to share that code in a separate branch, I could have a look tomorrow and see if I can test/figure out a way to speed it up. By the way, I would suggest to also store the max distance in the pickle dictionary because the deletes/words dictionary is directly related to the distance. I've been generating different dictionaries with various distances (and therefore different file sizes) but they all have the same suggestions because the SymSpell object I create always has a distance of 2, so the pickle file content is irrelevant. I'm not sure though if it's a good idea to change a user-provided setting when loading a dictionary, maybe only do that if the SymSpell object is created without specifying a distance, but keep the distance fixed if that argument is passed to |
@kakaroto i have created a new branch with an initial implementation of SuggestionStage |
Ah, thanks @mammothb! I see the problem and there's definitely some improvement that can be made here but the SuggestionStage is not the way to go. Looking at what the CS code does, I see this on every delete insertion :
That's indeed very time consuming, create a string array then copy the old array into the new one then add the new entry to it. So the suggestion stage makes this go much faster by doing this :
Which is basically: Create a linked list and append to the list instead then convert to a string array only at the end. Now in the Python implementation you're not creating arrays, you're already using a linked list and call
with this :
So you are already looping through the same amount of edits, and calling the same get_str_hash on every one and whatever that staging.add() does, if it's not faster than the already very fast python So there's a couple of things I noticed. First, the CS implementation uses a <int, string[]> dictionary, and since they use an int, that's why they need to hash the delete string, hence the
So I made this simple change, remove the calls to
and in lookup, do the same :
And then I tested the speeds again (average of 10 runs) :
So this simple change gives us about 40% speed improvement. I tried a lookup for "helo" and I got the same results and no crash, so yeay! But I haven't done any extensive testing on this, and I'm sure there are many other ways that we can use to make it more efficient. (Note, this would break the pickle. It would be good to add a 'version' in the pickle to make sure you only load from the same version of the internal data). This is the diff I have for now (I didn't clone/push because I haven't tested it extensively but once I look at the rest of the code and see how it can be improved further, then I'll make a proper patch and send a PR, but I had a quick look at the _edits function, it's gonna be hard to optimize that):
|
By the way I'm happy to see that the ~40% speed improvement in removing A quick little update as well (using frequency_dictionary_en_82_765.txt and the optimized code) :
So it looks like the biggest time consumption in loading/saving the pickle is the compression which sounds obvious now that I think about it. I also tested json and it's also taking longer with non-gzipped version (and higher file sizes) when compared to pickle. Note that the optimized code makes loading/saving the pickle faster too (9.5 vs 14.5s to save, 1.2 vs 1.5s to load) probably because the length of the hash was generally longer than the length of the average word. For reference, here's another test done with aspell's en_US dictionary which contains 120400 words (more words but smaller file size and faster to load/save) :
And for some more timing test, in
The time to load drops from 3.0s to 1.8s, and if I change it to :
It drops to 0.13s, which is again more or less consistent with the profiling information of about 20% in both the call to self._edits and in the So we currently have more than 50% of the new time spent on these 2 lines in
There isn't a lot that we can do to improve those I think. I think there's a better chance at improving the By the way, if someone has another dictionary to provide (@jalajthanaki you have one with 4M words?) I can test with that as well. |
@kakaroto any updates on the patch? I could implement the suggested improvement if you're busy with other things |
@mammothb sorry, I've been busy. I've spent some time trying to optimize the rest of it, but unfortunately unfruitfully. |
@kakaroto I think it's fine to just reject the old version and have the users manually create a new version themselves. |
As mentionned in a comment on issue mammothb#31 [1], the creating of a custom hashing algorithm is too slow when compared to the native string hashing algorithm of python. Instead of having a dictionary of ints, we just keep a dictionary of strings and this has no impact on functionality but improves the overall performance of loading a dictionary by about 40% This also adds a 'data_version' key to the pickle dictionary, which starts at 2 and load_pickle returns False if the data version is incompatible. [1] mammothb#31 (comment)
I've sent the pull request, let me know if there's anything more I can do! Thanks. |
Hi Symspellpy team,
Amazing tool....! Your team did fabulous job.
I have following queries.
This screenshot is for 82,765 words load time
This screenshot is for 4M words load time
Any help or suggestion will be helpful. Even detail explanation of each method help me. I will look forward to collaborating to symspell code if needed.
The text was updated successfully, but these errors were encountered: