BPE Dropout tokenizer generates unk at the beginning of sequence

I do `spm.SentencePieceTrainer.train` with the following parameters:

    save_directory: /tokenizer
    input: data.txt
    vocab_size: 16000
    model_type: bpe
    pad_id: 0
    eos_id: 1
    unk_id: 2
    bos_id: -1
    input_sentence_size: 10000000
    shuffle_input_sentence: true

After that I create T5Tokenizer of it and save it to use:

```python
hf_tokenizer = T5Tokenizer('tokenizer/tokenizer.model', extra_ids=0, legacy=False)`
hf_tokenizer.save_pretrained('tokenizer_directory')
```

Then I'm trying to use it with BPE-Dropout the following way:

```python
tokenizer_16000 = AutoTokenizer.from_pretrained(
    'tokenizer_directory',
    use_fast=False,
    sp_model_kwargs = {
        'enable_sampling': True,
        'alpha': 0.1
    }
    )
```

And statistically about 10% of times I will get `<unk>` token in the beginning of encoded sequence, although the next token is exactly the start of the sequence. I write test sequence myself so no hidden symbols of whatsoever:

```python
for i in range(10):
    encoded_text = tokenizer_16000('Прапорщик Задов опять здесь.')
    print(tokenizer_16000.convert_ids_to_tokens(encoded_text['input_ids']))
```

outputs:
```python
16000
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁', 'З', 'ад', 'ов', '▁оп', 'я', 'ть', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁Пр', 'ап', 'о', 'р', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁опять', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['ра', 'пор', 'щ', 'ик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
```

It never happens with vocab size 8000 with all same parameters. Why do `<unk>` appear there?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPE Dropout tokenizer generates unk at the beginning of sequence #1071

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development