Tokenizer further tokenizes pretokenized input #6575

bogdankostic · 2020-08-18T14:23:54Z

Environment info

transformers version: current master
Platform: MacOS
Python version: 3.7

Who can help

Information

It seems that passing pretokenized input to the Tokenizer and setting is_pretokenized=True doesn't prevent the Tokenizer from further tokenizing the input. This issue already came up in #6046 and the reason for this seems to be #6573 . A workaround is to set is_pretokenized=False.
What hasn't been reported yet is that this issue also arises with FastTokenizers where we see the same behavior. However, there is no workaround for FastTokenizers (or at least I haven't found one...). Setting is_pretokenized=False will raise a ValueError.

To reproduce

from transformers.tokenization_auto import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased", use_fast=True)

text = "Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist"
pretokenized_text = ['Schar', '##tau', 'sagte', 'dem', 'Tages', '##spiegel', ',', 'dass', 'Fischer', 'ein', 'Id', '##iot', 'ist']

tokenized = tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
pretokenized_tok = tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
pretokenized_tok_2 = tokenizer.encode(pretokenized_text, is_pretokenized=False)
# returns list of len 15 -> 13 tokens + 2 special tokens

fast_tokenized = fast_tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
fast_pretokenized_tok = fast_tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
# fast_pretokenizer_tok2 = fast_tokenizer.encode(pretokenized_text, is_pretokenized=False)
# would raise: 'ValueError: TextInputSequence must be str'


tokenized_decoded = tokenizer.decode(tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
pretokenized_tok_decoded = tokenizer.decode(pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'
pretokenized_tok_2_decoded = tokenizer.decode(pretokenized_tok_2)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'


fast_tokenized_decoded = fast_tokenizer.decode(fast_tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
fast_pretokenized_tok_decoded = fast_tokenizer.decode(fast_pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'

The text was updated successfully, but these errors were encountered:

thomwolf · 2020-08-18T15:44:09Z

Hi,

is_pretokenized=True actually means that you are providing a list of words as strings instead of a full sentence or paragraph not sub-words. The step which is skipped in this case is the pre tokenization step, not the tokenization step.

This is useful for NER or token classification for instance but I understand that the wording can be confusing, we will try to make it more clear in the docstring and the page of the doc (here) cc @sgugger and @LysandreJik

sgugger · 2020-08-18T15:48:43Z

Adding this to my TODO.

bogdankostic · 2020-08-25T16:35:18Z

Thanks for making this clear! :)

This reverts commit ab8e9d9.

bogdankostic mentioned this issue Aug 18, 2020

Add option to use fast HF tokenizer. deepset-ai/FARM#482

Merged

4 tasks

sgugger self-assigned this Aug 18, 2020

sgugger added a commit that referenced this issue Aug 19, 2020

Fix #6575

e716aae

sgugger mentioned this issue Aug 19, 2020

Fix #6575 #6596

Merged

sgugger closed this as completed in 18ca0e9 Aug 19, 2020

Zhylkaaa mentioned this issue Aug 25, 2020

is_pretokenized seems to work incorrectly #6046

Closed

2 tasks

Zigur pushed a commit to Zigur/transformers that referenced this issue Oct 26, 2020

Fix huggingface#6575 (huggingface#6596)

1ae406f

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020

Fix huggingface#6575 (huggingface#6596)

ab8e9d9

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020

Revert "Fix huggingface#6575 (huggingface#6596)"

f43899b

This reverts commit ab8e9d9.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer further tokenizes pretokenized input #6575

Tokenizer further tokenizes pretokenized input #6575

bogdankostic commented Aug 18, 2020

thomwolf commented Aug 18, 2020

sgugger commented Aug 18, 2020

bogdankostic commented Aug 25, 2020

Tokenizer further tokenizes pretokenized input #6575

Tokenizer further tokenizes pretokenized input #6575

Comments

bogdankostic commented Aug 18, 2020

Environment info

Who can help

Information

To reproduce

thomwolf commented Aug 18, 2020

sgugger commented Aug 18, 2020

bogdankostic commented Aug 25, 2020