-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for saving and loading non ASCII chars in corpus and vocab #86
Added support for saving and loading non ASCII chars in corpus and vocab #86
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment, also could you add the new test case from the issue?
Thanks for the 2nd commit, unfortunately seems like there's some issue causing the tests to fail: https://github.com/xhluca/bm25s/actions/runs/11994683583/job/33448614792#step:6:1 Perhaps running the tests locally could help diagnose the issue? |
@xhluca Sorry, I didn't quite know how to test all the test files at once, so only checked the core/test_save_load.py. Now, I've run both core and comparison tests using |
Can you elaborate on the changes to the test file: tests/core/test_tokenizer_misc.py ? |
In my local, I saw that test_tokenizer_misc.py was failing as the ids were not matching the results. In my case, it was coming up to be [2,0,3]. Then I checked with the corpus and query:
Thought that the id 2 |
It seems the tests are failing due to the changes to the test. Does github allow you to see the results? btw can you add the test file from here: #79 (comment) |
Looks good! |
Hi @xhluca, updated this hotfix/save-load-extended-character-corpora . Missed to add ensure-ascii=False while saving corpus. It is causing |
Please create a new one |
Made the changes as per issue #79