Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stopwords for 10 new languages #33

Merged
merged 7 commits into from
Aug 17, 2024
Merged

Conversation

bm777
Copy link
Contributor

@bm777 bm777 commented Jul 21, 2024

Add multi-language stopword support

This pull request addresses issue #32 by implementing support for stopwords in multiple languages.

Changes made:

  • Included stopword lists for the following languages:
    • English
    • German
    • Dutch
    • French
    • Spanish
    • Portuguese
    • Italian
    • Russian
    • Swedish
    • Norwegian
    • Chinese
  • Updated the tokenization.py file, especially _infer_stopwords function to consider other languages.

Implementation details:

  • Stopwords are now loaded based on the specified language
    # bm25 definition here
    corpus = [
         "Eine Katze ist eine Katze und schnurrt gerne",
         "Ein Hund ist der beste Freund des Menschen und liebt es zu spielen",
         "Ein Vogel ist ein wunderschönes Tier, das fliegen kann",
         "Ein Fisch ist ein Lebewesen, das im Wasser lebt und schwimmt",
    ]
    
    tids = bm25s.tokenize(corpus, stopwords="de")
  • Users can still easily add custom stopword lists for additional languages

Testing:

  • A baseline needs to be defined.

If any change made here needs to be modified, feel free to coment below.

Closes #32

Copy link

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for me, and we really need the Chinese stopwords, thanks!

@xhluca
Copy link
Owner

xhluca commented Jul 26, 2024

Thanks! If the tests pass I will merge.

@bm777
Copy link
Contributor Author

bm777 commented Jul 26, 2024

is it a stopwords issue?

this is the new STOPWORDS

@xhluca
Copy link
Owner

xhluca commented Jul 27, 2024

This is the error:

Finding newlines for mmindex:   0%|          | 0.00/8.11M [00:00<?, ?B/s]
Finding newlines for mmindex: 100%|██████████| 8.11M/8.11M [00:00<00:00, 268MB/s]

  0%|          | 0/5183 [00:00<?, ?it/s]
100%|██████████| 5183/5183 [00:00<00:00, 180124.76it/s]
.
======================================================================
FAIL: test_retrieve (tests.quick.test_retrieve.TestBM25SLoadingSaving)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/bm25s/bm25s/tests/quick/test_retrieve.py", line 70, in test_retrieve
    self.assertTrue(np.array_equal(ground_truth, results), f"Expected {ground_truth}, got {results}")
AssertionError: False is not true : Expected [[0]
 [0]], got [[2]
 [2]]

----------------------------------------------------------------------
Ran 33 tests in 71.591s

FAILED (failures=1, skipped=2)

I am away so can't really debug this in the next few days. I can look into this when im back. In the meantime, feel free to run the tests locally to see what is breaking (GitHub actions doesn't seem to show everything)

@bm777
Copy link
Contributor Author

bm777 commented Jul 30, 2024

I'm debugging it...

@aflip
Copy link
Contributor

aflip commented Aug 4, 2024

Is it better to add the stopwords into the library or add locally? I would like to add hindi and a few indian languages.

@bm777
Copy link
Contributor Author

bm777 commented Aug 4, 2024

@aflip yes, you can.

tids = bm25s.tokenize(corpus, stopwords=["your of stop words here"])

@xhluca
Copy link
Owner

xhluca commented Aug 10, 2024

You can also customize the regex template and use a custom stemmer, which makes it flexible for other languages.

@xhluca
Copy link
Owner

xhluca commented Aug 14, 2024

Any update on the tests failure here?

@xhluca
Copy link
Owner

xhluca commented Aug 17, 2024

Can you change STOPWORDS_EN to STOPWORDS_EN_PLUS? This will ensure that it is backward compatible. The tests should pass after that

@bm777
Copy link
Contributor Author

bm777 commented Aug 17, 2024

@xhluca Sorry for being absent. I was working on a side project that required attention.
If you believe it will pass, then I will do it now.

@xhluca
Copy link
Owner

xhluca commented Aug 17, 2024

It seems 522fbdc removed STOPWORDS_EN. We still need STOPWORDS_EN to be like the original (pre-PR), whereas STOPWORDS_EN_PLUS is what one can use if they want the enhanced stopwords you have added.

@xhluca
Copy link
Owner

xhluca commented Aug 17, 2024

Btw, the new main has decoupled core tests from comparison tests. feel free to add test_stopwords.py to core tests with a simple tests, here's the template

@xhluca
Copy link
Owner

xhluca commented Aug 17, 2024

import unittest
import numpy as np

import bm25s


class TestAddNameHere(unittest.TestCase):
    def setUp(self):
        # Create your corpus here
        self.corpus = [
            "a cat is a feline and likes to purr",
            "a dog is the human's best friend and loves to play",
            "a bird is a beautiful animal that can fly",
            "a fish is a creature that lives in water and swims",
        ]

    def test_add_here(self):
        corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en")
        # continue here

if __name__ == '__main__':
    unittest.main()

Copy link
Owner

@xhluca xhluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's from nltk, we should add reference to the original stopwords

@@ -33,3 +33,2774 @@
"will",
"with",
)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!

Suggested change
# The stopwords below are retrieved from NLTK: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -35,13 +48,35 @@ def convert_tokenized_to_string_list(tokenized: Tokenized) -> List[List[str]]:


def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!

Suggested change
def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]:
"""
Source of stopwords: ...
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

@xhluca
Copy link
Owner

xhluca commented Aug 17, 2024

Thank you for taking this to the finish line! Merging this now.

@xhluca xhluca changed the title Add stopwords Add stopwords for 10 new languages Aug 17, 2024
@xhluca xhluca merged commit 2ca03b5 into xhluca:main Aug 17, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Other language than english for the stopwords list
4 participants