Added support for saving and loading non ASCII chars in corpus and vocab #86

IssacXid · 2024-11-22T12:22:50Z

Made the changes as per issue #79

xhluca

See comment, also could you add the new test case from the issue?

bm25s/__init__.py

xhluca · 2024-11-24T22:15:20Z

Thanks for the 2nd commit, unfortunately seems like there's some issue causing the tests to fail: https://github.com/xhluca/bm25s/actions/runs/11994683583/job/33448614792#step:6:1

Perhaps running the tests locally could help diagnose the issue?

IssacXid · 2024-11-25T06:18:43Z

@xhluca Sorry, I didn't quite know how to test all the test files at once, so only checked the core/test_save_load.py. Now, I've run both core and comparison tests using python -m unittest tests/(core or comparison)/test_*.py cmd, it ran without error. Please review. Thanks

xhluca · 2024-11-25T06:46:32Z

Can you elaborate on the changes to the test file: tests/core/test_tokenizer_misc.py ?

IssacXid · 2024-11-25T07:02:51Z

Can you elaborate on the changes to the test file: tests/core/test_tokenizer_misc.py ?

In my local, I saw that test_tokenizer_misc.py was failing as the ids were not matching the results. In my case, it was coming up to be [2,0,3]. Then I checked with the corpus and query:

corpus = [
            "a cat is a feline and likes to purr",
            "a dog is the human's best friend and loves to play",
            "a bird is a beautiful animal that can fly",
            "a fish is a creature that lives in water and swims",
        ]
query = "What is a fly?"

Thought that the id 2 "a bird is a beautiful animal that can fly" should be at least ranked first, which was also the result coming in my case. So, I edited the assertion.

xhluca · 2024-11-25T16:50:25Z

It seems the tests are failing due to the changes to the test. Does github allow you to see the results?

btw can you add the test file from here: #79 (comment)

IssacXid · 2024-11-26T04:39:35Z

@xhluca Yes, github is allowing me to see the test results. I've changed back the assertion in test_tokenizer_misc.py. I also added the test as #79, which is passing in my local. In case, this doesn't pass in the github actions, let's please connect over a webcall.

xhluca · 2024-11-26T06:31:48Z

Looks good!

IssacXid · 2024-12-03T07:53:47Z

Hi @xhluca, updated this hotfix/save-load-extended-character-corpora . Missed to add ensure-ascii=False while saving corpus. It is causing invalid escape sequence failure while retrieving along with corpus. Should I create another PR or can you re-open this one?

xhluca · 2024-12-03T13:52:53Z

Please create a new one

Added support for saving and loading non ASCII chars in corpus and vocab

db21387

xhluca requested changes Nov 22, 2024

View reviewed changes

bm25s/__init__.py Outdated Show resolved Hide resolved

Added kwargs in json_functions for ensure_ascii parameter

fb20baf

Added ensure_ascii = False in tokenization

0759d3f

IssacXid added 2 commits November 26, 2024 10:05

Added the test in xhluca#79

1191d8d

Remodified the assertion as per github action

3bf3e07

xhluca merged commit db53725 into xhluca:main Nov 26, 2024
2 checks passed

IssacXid mentioned this pull request Dec 22, 2024

Extending to Non-ASCII characters with corpora loading and saving #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for saving and loading non ASCII chars in corpus and vocab #86

Added support for saving and loading non ASCII chars in corpus and vocab #86

IssacXid commented Nov 22, 2024

xhluca left a comment

xhluca commented Nov 24, 2024

IssacXid commented Nov 25, 2024 •

edited

Loading

xhluca commented Nov 25, 2024

IssacXid commented Nov 25, 2024

xhluca commented Nov 25, 2024 •

edited

Loading

IssacXid commented Nov 26, 2024 •

edited

Loading

xhluca commented Nov 26, 2024

IssacXid commented Dec 3, 2024

xhluca commented Dec 3, 2024

Added support for saving and loading non ASCII chars in corpus and vocab #86

Added support for saving and loading non ASCII chars in corpus and vocab #86

Conversation

IssacXid commented Nov 22, 2024

xhluca left a comment

Choose a reason for hiding this comment

xhluca commented Nov 24, 2024

IssacXid commented Nov 25, 2024 • edited Loading

xhluca commented Nov 25, 2024

IssacXid commented Nov 25, 2024

xhluca commented Nov 25, 2024 • edited Loading

IssacXid commented Nov 26, 2024 • edited Loading

xhluca commented Nov 26, 2024

IssacXid commented Dec 3, 2024

xhluca commented Dec 3, 2024

IssacXid commented Nov 25, 2024 •

edited

Loading

xhluca commented Nov 25, 2024 •

edited

Loading

IssacXid commented Nov 26, 2024 •

edited

Loading