-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add stopwords for 10 new languages #33
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me, and we really need the Chinese stopwords, thanks!
Thanks! If the tests pass I will merge. |
is it a stopwords issue? this is the new STOPWORDS |
This is the error:
I am away so can't really debug this in the next few days. I can look into this when im back. In the meantime, feel free to run the tests locally to see what is breaking (GitHub actions doesn't seem to show everything) |
I'm debugging it... |
Is it better to add the stopwords into the library or add locally? I would like to add hindi and a few indian languages. |
@aflip yes, you can. tids = bm25s.tokenize(corpus, stopwords=["your of stop words here"]) |
You can also customize the regex template and use a custom stemmer, which makes it flexible for other languages. |
Any update on the tests failure here? |
Can you change |
@xhluca Sorry for being absent. I was working on a side project that required attention. |
It seems 522fbdc removed |
Btw, the new |
import unittest
import numpy as np
import bm25s
class TestAddNameHere(unittest.TestCase):
def setUp(self):
# Create your corpus here
self.corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
def test_add_here(self):
corpus_tokens = bm25s.tokenize(self.corpus, stopwords="en")
# continue here
if __name__ == '__main__':
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's from nltk, we should add reference to the original stopwords
@@ -33,3 +33,2774 @@ | |||
"will", | |||
"with", | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!
# The stopwords below are retrieved from NLTK: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -35,13 +48,35 @@ def convert_tokenized_to_string_list(tokenized: Tokenized) -> List[List[str]]: | |||
|
|||
|
|||
def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a notice that the stopwords were taken from NLTK, and link to the exact file/page where they were retrieved!
def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]: | |
def _infer_stopwords(stopwords: Union[str, List[str]]) -> List[str]: | |
""" | |
Source of stopwords: ... | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
Thank you for taking this to the finish line! Merging this now. |
Add multi-language stopword support
This pull request addresses issue #32 by implementing support for stopwords in multiple languages.
Changes made:
Implementation details:
Testing:
If any change made here needs to be modified, feel free to coment below.
Closes #32