-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer further tokenizes pretokenized input #6575
Comments
Hi,
This is useful for NER or token classification for instance but I understand that the wording can be confusing, we will try to make it more clear in the docstring and the page of the doc (here) cc @sgugger and @LysandreJik |
Adding this to my TODO. |
Thanks for making this clear! :) |
This reverts commit ab8e9d9.
Environment info
transformers
version: current masterWho can help
@mfuntowicz
Information
It seems that passing pretokenized input to the Tokenizer and setting
is_pretokenized=True
doesn't prevent the Tokenizer from further tokenizing the input. This issue already came up in #6046 and the reason for this seems to be #6573 . A workaround is to setis_pretokenized=False
.What hasn't been reported yet is that this issue also arises with FastTokenizers where we see the same behavior. However, there is no workaround for FastTokenizers (or at least I haven't found one...). Setting
is_pretokenized=False
will raise a ValueError.To reproduce
The text was updated successfully, but these errors were encountered: