-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization with exception patterns #700
Tokenization with exception patterns #700
Conversation
Thanks! Reactions, in stream-of-consciousness form: Interesting approach. I've so far resisted introducing more regex-based logic into the tokenizer. My two concerns have been:
So, my kneejerk reaction was "Oh, this isn't how we want to do this". But, then again: it's currently difficult to express the necessary logic for the URL tokenization in the tokenizer. So maybe we do need a mechanism like this. If you have a minute, it would be nice to benchmark this. The toolset I use is in the spacy-benchmarks repository. I expect that the way you're doing this, there shouldn't be much or any additional performance problem. It's just like the prefix and suffix expressions, the question is only asked on chunks that can't be tokenized using vocabulary items, and the expression will only match deeply on strings that are actually URLs. So, I think the benchmark will come out to be no problem. I do have one suggested improvement, though. Let's say we have the string:
We want this to be tokenized into:
I suggest we rely on the prefix and suffix expressions to strip the attached tokens. This way, we only need to handle I think this will give a clearer division of labour between the different parts of the tokenizer. We'll have the following:
What do you think? |
Peformance -> Performance
fixed minor typo
Hi @honnibal, thanks for the feedback! I've tried to use your benchmark repo, but ran into several problems. :( The biggest obstacles were that the Gigaword corpus is not freely accessible, and The results are a bit disappointing. Tokenizing the corpus with spaCy 1.4 and with my changes:
What is strange, that when I explicitly set the matcher to Anyway, I really liked your idea on making this improvement more general. I'll definitely modify the PR accordingly when I figured out the reason why the tokenization became that slow. |
Do you have a vocab file loaded in your version's virtualenv? There's a bit of a footgun in spacy atm: if you start with no vocab, it doesnt cache any tokenization, and it ends up quite slow. |
Remove reflexive pronouns as they're part of an open class, fix mistakes and add exceptions
Thanks, downloading the model helped a lot! Now my changes are in pair with the I will update soon the PR with the new mechanism. |
…trary functions as token exception matchers.
@honnibal What do you think about the changes now? Do you think that spaCy can profit from this new feature? |
Initial support for Hungarian
Added Swedish abbreviations
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
Add PART to tag map
Check that the new token_match function applies after punctuation is split off.
Hey, Thanks, looking good! I added some tests for the trickier interactions, with the punctuation. My guess is this currently fails, but I haven't had a chance to check yet. I think you'll need to check the Once we get these extra cases covered, we just need to update the docs and we're good to go! I'm happy to make the docs changes if that's easier for you. |
Hey, thanks for the reply and for the exhaustive test cases. The implementation in this PR iteratively checks substrings with |
…m/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
I need to fix Travis for pull requests, but I think this works — it's green on my local copy. What do you think? |
Looks good to me, tests are passing here as well. Thanks! (I misunderstood sg. while I was writing in my previous comment...) Rerun my benchmark scripts, results are:
|
🎉 Merging! |
Using regular expression for exception handling during tokenization
Description
Modified the tokenizer algorithm enabling users to incorporate regexp patterns for handling tokenization exceptions
Motivation and Context
This PR fixes #344 and allows the tokenizer to use arbitrary patterns as exceptions.
How Has This Been Tested?
New tests are added at
tokenizer/test_urls.py
Screenshots (if appropriate):
NA
Types of changes
Checklist: