Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for saving and loading non ASCII chars in corpus and vocab #86

Merged

Conversation

IssacXid
Copy link
Contributor

Made the changes as per issue #79

Copy link
Owner

@xhluca xhluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment, also could you add the new test case from the issue?

bm25s/__init__.py Outdated Show resolved Hide resolved
@xhluca
Copy link
Owner

xhluca commented Nov 24, 2024

Thanks for the 2nd commit, unfortunately seems like there's some issue causing the tests to fail: https://github.com/xhluca/bm25s/actions/runs/11994683583/job/33448614792#step:6:1

Perhaps running the tests locally could help diagnose the issue?

@IssacXid
Copy link
Contributor Author

IssacXid commented Nov 25, 2024

@xhluca Sorry, I didn't quite know how to test all the test files at once, so only checked the core/test_save_load.py. Now, I've run both core and comparison tests using python -m unittest tests/(core or comparison)/test_*.py cmd, it ran without error. Please review. Thanks

@xhluca
Copy link
Owner

xhluca commented Nov 25, 2024

Can you elaborate on the changes to the test file: tests/core/test_tokenizer_misc.py ?

@IssacXid
Copy link
Contributor Author

Can you elaborate on the changes to the test file: tests/core/test_tokenizer_misc.py ?

In my local, I saw that test_tokenizer_misc.py was failing as the ids were not matching the results. In my case, it was coming up to be [2,0,3]. Then I checked with the corpus and query:

corpus = [
            "a cat is a feline and likes to purr",
            "a dog is the human's best friend and loves to play",
            "a bird is a beautiful animal that can fly",
            "a fish is a creature that lives in water and swims",
        ]
query = "What is a fly?"

Thought that the id 2 "a bird is a beautiful animal that can fly" should be at least ranked first, which was also the result coming in my case. So, I edited the assertion.

@xhluca
Copy link
Owner

xhluca commented Nov 25, 2024

It seems the tests are failing due to the changes to the test. Does github allow you to see the results?

btw can you add the test file from here: #79 (comment)

@IssacXid
Copy link
Contributor Author

IssacXid commented Nov 26, 2024

@xhluca Yes, github is allowing me to see the test results. I've changed back the assertion in test_tokenizer_misc.py. I also added the test as #79, which is passing in my local. In case, this doesn't pass in the github actions, let's please connect over a webcall.

@xhluca
Copy link
Owner

xhluca commented Nov 26, 2024

Looks good!

@xhluca xhluca merged commit db53725 into xhluca:main Nov 26, 2024
2 checks passed
@IssacXid
Copy link
Contributor Author

IssacXid commented Dec 3, 2024

Hi @xhluca, updated this hotfix/save-load-extended-character-corpora . Missed to add ensure-ascii=False while saving corpus. It is causing invalid escape sequence failure while retrieving along with corpus. Should I create another PR or can you re-open this one?

@xhluca
Copy link
Owner

xhluca commented Dec 3, 2024

Please create a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants