-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenisation exceptions for English pronoun "She" #718
Comments
Thanks – you're absolutely correct. Looks like this was a copy-paste mistake that wasn't spotted. Fixing this now! Also currently working on making the English exceptions a little easier to read – for example, ordering them properly etc. |
Thanks ines. Re-ordering / grouping the exceptions would certainly help. |
For the pronouns, I'm thinking about doing something like: for pron in ["you", "You", "we", "We", "they", "They"]:
EXC[pron + "'re"] = [
{ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'re", LEMMA: "be", NORM: "are"}
]
EXC[pron + "re"] = [
{ORTH: pron, LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "re", LEMMA: "be", NORM: "are"}
] This obviously doesn't work for all exceptions and languages, and not everything follows such a nice pattern. But for the tokens that do, I think it could help with making sure everything's consistent. (Plus, it'd easily cut a couple of hundred lines.) |
I know that @honnibal don't want to overuse regexps in spaCy, but I think that in these case that would not just be more efficient but is easier to maintain as well. If we can utilize the return value of |
@oroszgy Hm, so I agree regexes would be more elegant, but even with the new |
Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.
Add logic to generate exceptions that follow a consistent pattern (like verbs and pronouns) and allow certain tokens to be excluded explicitly.
@ines I meant, that we could simply change the semantics of the Instead of
|
I really think it's better to express the exceptions as literal strings. We're hashing these, so it's (super fast) O(1) operation per-chunk to check the exceptions. I do think we'd hit a performance problem from moving this to regular expressions. The expansion logic also gets quite tricky if you're writing an expression. Let's say you write a pattern like this:
You need to look up the regex capture groups to fill out the In order to make the expansion logic sane, you'll be tempted to separate out the matches into different expressions. Once you do that, you're not really any further ahead on generality/compactness than you are with the explicit exceptions. |
Honnibal, thanks for the explanation! For some reason I didn't notice that for loops are executed at loading time and not running time. Of course, in this case hashing is way more effective than regular expressions. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tokenizer_exceptions.py for English contains several entries for the pronoun 'She' with ORTH value 'i'. for example:
"She's": [
{ORTH: "i", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]
Unless I am missing something, this should be:
"She's": [
{ORTH: "She", LEMMA: PRON_LEMMA, TAG: "PRP"},
{ORTH: "'s"}
]
The text was updated successfully, but these errors were encountered: