Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer further tokenizes pretokenized input #6575

Closed
bogdankostic opened this issue Aug 18, 2020 · 3 comments
Closed

Tokenizer further tokenizes pretokenized input #6575

bogdankostic opened this issue Aug 18, 2020 · 3 comments
Assignees

Comments

@bogdankostic
Copy link
Contributor

Environment info

  • transformers version: current master
  • Platform: MacOS
  • Python version: 3.7

Who can help

@mfuntowicz

Information

It seems that passing pretokenized input to the Tokenizer and setting is_pretokenized=True doesn't prevent the Tokenizer from further tokenizing the input. This issue already came up in #6046 and the reason for this seems to be #6573 . A workaround is to set is_pretokenized=False.
What hasn't been reported yet is that this issue also arises with FastTokenizers where we see the same behavior. However, there is no workaround for FastTokenizers (or at least I haven't found one...). Setting is_pretokenized=False will raise a ValueError.

To reproduce

from transformers.tokenization_auto import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased", use_fast=True)

text = "Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist"
pretokenized_text = ['Schar', '##tau', 'sagte', 'dem', 'Tages', '##spiegel', ',', 'dass', 'Fischer', 'ein', 'Id', '##iot', 'ist']

tokenized = tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
pretokenized_tok = tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
pretokenized_tok_2 = tokenizer.encode(pretokenized_text, is_pretokenized=False)
# returns list of len 15 -> 13 tokens + 2 special tokens

fast_tokenized = fast_tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
fast_pretokenized_tok = fast_tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
# fast_pretokenizer_tok2 = fast_tokenizer.encode(pretokenized_text, is_pretokenized=False)
# would raise: 'ValueError: TextInputSequence must be str'


tokenized_decoded = tokenizer.decode(tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
pretokenized_tok_decoded = tokenizer.decode(pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'
pretokenized_tok_2_decoded = tokenizer.decode(pretokenized_tok_2)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'


fast_tokenized_decoded = fast_tokenizer.decode(fast_tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
fast_pretokenized_tok_decoded = fast_tokenizer.decode(fast_pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'
@thomwolf
Copy link
Member

Hi,

is_pretokenized=True actually means that you are providing a list of words as strings instead of a full sentence or paragraph not sub-words. The step which is skipped in this case is the pre tokenization step, not the tokenization step.

This is useful for NER or token classification for instance but I understand that the wording can be confusing, we will try to make it more clear in the docstring and the page of the doc (here) cc @sgugger and @LysandreJik

@sgugger
Copy link
Collaborator

sgugger commented Aug 18, 2020

Adding this to my TODO.

@sgugger sgugger self-assigned this Aug 18, 2020
sgugger added a commit that referenced this issue Aug 19, 2020
@sgugger sgugger mentioned this issue Aug 19, 2020
@bogdankostic
Copy link
Contributor Author

Thanks for making this clear! :)

Zigur pushed a commit to Zigur/transformers that referenced this issue Oct 26, 2020
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants