Add stopwords for 10 new languages #33

bm777 · 2024-07-21T01:28:29Z

Add multi-language stopword support

This pull request addresses issue #32 by implementing support for stopwords in multiple languages.

Changes made:

Included stopword lists for the following languages:
- English
- German
- Dutch
- French
- Spanish
- Portuguese
- Italian
- Russian
- Swedish
- Norwegian
- Chinese
Updated the tokenization.py file, especially _infer_stopwords function to consider other languages.

Implementation details:

Stopwords are now loaded based on the specified language

# bm25 definition here
corpus = [
     "Eine Katze ist eine Katze und schnurrt gerne",
     "Ein Hund ist der beste Freund des Menschen und liebt es zu spielen",
     "Ein Vogel ist ein wunderschönes Tier, das fliegen kann",
     "Ein Fisch ist ein Lebewesen, das im Wasser lebt und schwimmt",
]

tids = bm25s.tokenize(corpus, stopwords="de")

Users can still easily add custom stopword lists for additional languages

Testing:

A baseline needs to be defined.

If any change made here needs to be modified, feel free to coment below.

Closes #32

yewentao256

Looks good for me, and we really need the Chinese stopwords, thanks!

xhluca · 2024-07-26T07:03:09Z

Thanks! If the tests pass I will merge.

bm777 · 2024-07-26T07:42:54Z

is it a stopwords issue?

this is the new STOPWORDS

xhluca · 2024-07-27T06:46:52Z

This is the error:

Finding newlines for mmindex:   0%|          | 0.00/8.11M [00:00<?, ?B/s]
Finding newlines for mmindex: 100%|██████████| 8.11M/8.11M [00:00<00:00, 268MB/s]

  0%|          | 0/5183 [00:00<?, ?it/s]
100%|██████████| 5183/5183 [00:00<00:00, 180124.76it/s]
.
======================================================================
FAIL: test_retrieve (tests.quick.test_retrieve.TestBM25SLoadingSaving)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/bm25s/bm25s/tests/quick/test_retrieve.py", line 70, in test_retrieve
    self.assertTrue(np.array_equal(ground_truth, results), f"Expected {ground_truth}, got {results}")
AssertionError: False is not true : Expected [[0]
 [0]], got [[2]
 [2]]

----------------------------------------------------------------------
Ran 33 tests in 71.591s

FAILED (failures=1, skipped=2)

I am away so can't really debug this in the next few days. I can look into this when im back. In the meantime, feel free to run the tests locally to see what is breaking (GitHub actions doesn't seem to show everything)

bm777 · 2024-07-30T08:32:55Z

I'm debugging it...

aflip · 2024-08-04T14:40:04Z

Is it better to add the stopwords into the library or add locally? I would like to add hindi and a few indian languages.

bm777 · 2024-08-04T16:04:53Z

@aflip yes, you can.

tids = bm25s.tokenize(corpus, stopwords=["your of stop words here"])

xhluca · 2024-08-10T14:25:53Z

You can also customize the regex template and use a custom stemmer, which makes it flexible for other languages.

xhluca · 2024-08-14T17:14:12Z

Any update on the tests failure here?

xhluca · 2024-08-17T20:53:48Z

Can you change STOPWORDS_EN to STOPWORDS_EN_PLUS? This will ensure that it is backward compatible. The tests should pass after that

bm777 · 2024-08-17T20:56:19Z

@xhluca Sorry for being absent. I was working on a side project that required attention.
If you believe it will pass, then I will do it now.

xhluca · 2024-08-17T21:29:19Z

It seems 522fbdc removed STOPWORDS_EN. We still need STOPWORDS_EN to be like the original (pre-PR), whereas STOPWORDS_EN_PLUS is what one can use if they want the enhanced stopwords you have added.

xhluca · 2024-08-17T21:33:31Z

Btw, the new main has decoupled core tests from comparison tests. feel free to add test_stopwords.py to core tests with a simple tests, here's the template

xhluca · 2024-08-17T21:36:13Z

import unittest
import numpy as np

import bm25s


class TestAddNameHere(unittest.TestCase):
    def setUp(self):
        # Create your corpus here
        self.corpus = [
            "a cat is a feline and likes to purr",
            "a dog is the human's best friend and loves to play",
            "a bird is a beautiful animal that can fly",
            "a fish is a creature that lives in water and swims",
        ]

    def test_add_here(self):
        corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en")
        # continue here

if __name__ == '__main__':
    unittest.main()

xhluca

Since it's from nltk, we should add reference to the original stopwords

xhluca · 2024-08-17T21:40:29Z

bm25s/stopwords.py

@@ -33,3 +33,2774 @@
    "will",
    "with",
 )
+


It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!

Suggested change

# The stopwords below are retrieved from NLTK: ...

xhluca · 2024-08-17T21:41:28Z

bm25s/tokenization.py

@@ -35,13 +48,35 @@ def convert_tokenized_to_string_list(tokenized: Tokenized) -> List[List[str]]:


 def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:


It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!

Suggested change

def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:

def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:

"""

Source of stopwords: ...

"""

Make sense.

xhluca · 2024-08-17T22:48:55Z

Thank you for taking this to the finish line! Merging this now.

bm777 added 2 commits July 20, 2024 18:17

added stopwords for 11 languages

eee2305

attemp to add empty test for stopwords

5964f6a

yewentao256 approved these changes Jul 22, 2024

View reviewed changes

yewentao256 mentioned this pull request Jul 24, 2024

add bm25 similarities LazyAGI/LazyLLM#89

Merged

bm777 mentioned this pull request Jul 30, 2024

How to apply bm25s to languages such as Chinese? #34

Closed

bm777 added 2 commits August 17, 2024 22:59

Merge branch 'xhluca:main' into add-stopwords

5b764c9

changed STOPWORDS_EN to STOPWORDS_EN_PLUS

522fbdc

re-added STOPWORDS_EN

33ec0b0

xhluca requested changes Aug 17, 2024

View reviewed changes

bm777 added 2 commits August 17, 2024 23:57

Merge branch 'xhluca:main' into add-stopwords

8db4637

added the source of stopwords

fd69f19

xhluca changed the title ~~Add stopwords~~ Add stopwords for 10 new languages Aug 17, 2024

xhluca merged commit 2ca03b5 into xhluca:main Aug 17, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stopwords for 10 new languages #33

Add stopwords for 10 new languages #33

bm777 commented Jul 21, 2024

yewentao256 left a comment

xhluca commented Jul 26, 2024

bm777 commented Jul 26, 2024 •

edited

Loading

xhluca commented Jul 27, 2024

bm777 commented Jul 30, 2024

aflip commented Aug 4, 2024

bm777 commented Aug 4, 2024

xhluca commented Aug 10, 2024

xhluca commented Aug 14, 2024 •

edited

Loading

xhluca commented Aug 17, 2024 •

edited

Loading

bm777 commented Aug 17, 2024

xhluca commented Aug 17, 2024 •

edited

Loading

xhluca commented Aug 17, 2024

xhluca commented Aug 17, 2024 •

edited

Loading

xhluca left a comment

xhluca Aug 17, 2024

bm777 Aug 17, 2024

xhluca Aug 17, 2024

bm777 Aug 17, 2024

xhluca commented Aug 17, 2024

@@ @@ -33,3 +33,2774 @@ @@
                   "will",
                   "with",
               )

		@@ -35,13 +48,35 @@ def convert_tokenized_to_string_list(tokenized: Tokenized) -> List[List[str]]:


		def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:

Add stopwords for 10 new languages #33

Add stopwords for 10 new languages #33

Conversation

bm777 commented Jul 21, 2024

Add multi-language stopword support

Changes made:

Implementation details:

Testing:

yewentao256 left a comment

Choose a reason for hiding this comment

xhluca commented Jul 26, 2024

bm777 commented Jul 26, 2024 • edited Loading

xhluca commented Jul 27, 2024

bm777 commented Jul 30, 2024

aflip commented Aug 4, 2024

bm777 commented Aug 4, 2024

xhluca commented Aug 10, 2024

xhluca commented Aug 14, 2024 • edited Loading

xhluca commented Aug 17, 2024 • edited Loading

bm777 commented Aug 17, 2024

xhluca commented Aug 17, 2024 • edited Loading

xhluca commented Aug 17, 2024

xhluca commented Aug 17, 2024 • edited Loading

xhluca left a comment

Choose a reason for hiding this comment

xhluca Aug 17, 2024

Choose a reason for hiding this comment

bm777 Aug 17, 2024

Choose a reason for hiding this comment

xhluca Aug 17, 2024

Choose a reason for hiding this comment

bm777 Aug 17, 2024

Choose a reason for hiding this comment

xhluca commented Aug 17, 2024

bm777 commented Jul 26, 2024 •

edited

Loading

xhluca commented Aug 14, 2024 •

edited

Loading

xhluca commented Aug 17, 2024 •

edited

Loading

xhluca commented Aug 17, 2024 •

edited

Loading

xhluca commented Aug 17, 2024 •

edited

Loading