is_pretokenized seems to work incorrectly #6046

Zhylkaaa · 2020-07-26T23:23:46Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): roberta

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

I use RobertaTokenizerFast on pretokenized text, but problem arises when I switch to slow version too

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

I am trying to implement sliding window for roberta

To reproduce

I use tokenizer.tokenize(text) method to tokenize whole text (1-3 sentences), when I divide tokens into chunks and try to use __call__ method (I also tried encode) with is_pretokenized=True argument, but this creates additional tokens (like 3 times more then should be). I worked this around by using tokenize -> convert_tokens_to_ids -> prepare_for_model -> pad pipeline, but I believe that batch methods should be faster and more memory efficient
Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained('roberta-base', add_prefix_space=True, use_fast=True)
ex_text = 'long text'
tokens = tokenizer.tokenize(ex_text)
examples = [tokens[i:i+126] for i in range(0, len(tokens), 100)]
print(len(tokenizer(examples, is_pretokenized=True)['input_ids'][0])) # this prints more than 128

Expected behavior

I would expect to get result similar to result I get when I use

tokens = tokeniser.tokenize(ex_text)
inputs = tokenizer.convert_tokens_to_ids(tokens)
inputs = [inputs[i:i+126] for i in range(0, len(tokens), 100)]
inputs = [tokenizer.prepare_for_model(example) for example in inputs] 
inputs = tokenizer.pad(inputs, padding='longest')

Am I doing something wrong or it's unexpected behaviour?

Environment info

transformers version: 3.0.2
Platform: MacOs
Python version: 3.8.3
PyTorch version (GPU?): 1.5.1 (no GPU)
Tensorflow version (GPU?): NO
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

EDIT:
I see that when I use __call__ it actually treat Ġ as 2 tokens:
tokenizer(tokenizer.tokenize('How'), is_pretokenized=True)['input_ids']
out: [0, 4236, 21402, 6179, 2] where 4236, 21402 is Ġ

The text was updated successfully, but these errors were encountered:

tholor · 2020-08-06T11:11:28Z

We face a similar issue with the distilbert tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-german-cased")
tokens = ['1980', 'kam', 'der', 'Crow', '##n', 'von', 'Toy', '##ota']
result = tokenizer.encode_plus(text=tokens,
                               text_pair=None,
                               add_special_tokens=True,
                               truncation=False,
                               return_special_tokens_mask=True,
                               return_token_type_ids=True,
                               is_pretokenized=True
                               )
result["input_ids"]
# returns:
[102,
 3827,
 1396,
 125,
 28177,
 1634,
 1634,
 151,
 195,
 25840,
 1634,
 1634,
 23957,
 30887,
 103]

tokenizer.decode(result["input_ids"])
# returns:
'[CLS] 1980 kam der Crow # # n von Toy # # ota [SEP]'

It seems that subword tokens (here ##n and ##ota) get split into further tokens even though we set is_pretokenized=True. This seems unexpected to me but maybe I am missing something?

Zhylkaaa · 2020-08-07T00:39:16Z

As I mentioned before we used is_pretokenized to create sliding window, but recently discovered that this can be achieved using:

stride = max_seq_length - 2 - int(max_seq_length*stride)
tokenized_examples = tokenizer(examples, return_overflowing_tokens=True, 
                               max_length=max_seq_length, stride=stride, truncation=True)

this returns dict with input_ids, attention_mask and overflow_to_sample_mapping (this helps to map between windows and example, but you should check for its presence, if you pass 1 short example it might not be there).

Hope this will help someone 🤗

PhilipMay · 2020-08-07T19:56:08Z

I have the same issue as @tholor - there seem to be some nasty differences between slow and fast tokenizer implementations.

chrk623 · 2020-08-18T12:11:11Z

Just got the same issue with bert-base-uncased, However if when is_pretokenized=False it seems to be OK. Is this expected behaviour?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text  = "huggingface transformers"
tok = tokenizer.tokenize(text)
print(tok)
# ['hugging', '##face', 'transformers']

output = tokenizer.encode_plus(tok, is_pretokenized=True)
tokenizer.convert_ids_to_tokens(output["input_ids"])
# ['[CLS]', 'hugging', '#', '#', 'face', 'transformers', '[SEP]']

when is_pretokenized=False

output2 = tokenizer.encode_plus(tok, is_pretokenized=False)
tokenizer.convert_ids_to_tokens(output2["input_ids"])
# ['[CLS]', 'hugging', '##face', 'transformers', '[SEP]']

Zhylkaaa · 2020-08-25T12:09:59Z

I believe that this issue can be closed because of explanation in #6575 stating that is_pretokenized expect just list of words spited by white space not actual tokens. So this is "kind of expected" behaviour :)

Zhylkaaa changed the title ~~Is_pretokenized seems to not work~~ is_pretokenized seems to work incorrectly Jul 27, 2020

tholor mentioned this issue Aug 6, 2020

Add option to use fast HF tokenizer. deepset-ai/FARM#482

Merged

4 tasks

bogdankostic mentioned this issue Aug 18, 2020

Tokenizer further tokenizes pretokenized input #6575

Closed

Zhylkaaa closed this as completed Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_pretokenized seems to work incorrectly #6046

is_pretokenized seems to work incorrectly #6046

Zhylkaaa commented Jul 26, 2020 •

edited

Loading

tholor commented Aug 6, 2020

Zhylkaaa commented Aug 7, 2020

PhilipMay commented Aug 7, 2020

chrk623 commented Aug 18, 2020

Zhylkaaa commented Aug 25, 2020

is_pretokenized seems to work incorrectly #6046

is_pretokenized seems to work incorrectly #6046

Comments

Zhylkaaa commented Jul 26, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

tholor commented Aug 6, 2020

Zhylkaaa commented Aug 7, 2020

PhilipMay commented Aug 7, 2020

chrk623 commented Aug 18, 2020

Zhylkaaa commented Aug 25, 2020

Zhylkaaa commented Jul 26, 2020 •

edited

Loading