Skip to content

BPE Dropout tokenizer generates unk at the beginning of sequence #1071

Open
@AnnaLebedeva

Description

I do spm.SentencePieceTrainer.train with the following parameters:

save_directory: /tokenizer
input: data.txt
vocab_size: 16000
model_type: bpe
pad_id: 0
eos_id: 1
unk_id: 2
bos_id: -1
input_sentence_size: 10000000
shuffle_input_sentence: true

After that I create T5Tokenizer of it and save it to use:

hf_tokenizer = T5Tokenizer('tokenizer/tokenizer.model', extra_ids=0, legacy=False)`
hf_tokenizer.save_pretrained('tokenizer_directory')

Then I'm trying to use it with BPE-Dropout the following way:

tokenizer_16000 = AutoTokenizer.from_pretrained(
    'tokenizer_directory',
    use_fast=False,
    sp_model_kwargs = {
        'enable_sampling': True,
        'alpha': 0.1
    }
    )

And statistically about 10% of times I will get <unk> token in the beginning of encoded sequence, although the next token is exactly the start of the sequence. I write test sequence myself so no hidden symbols of whatsoever:

for i in range(10):
    encoded_text = tokenizer_16000('Прапорщик Задов опять здесь.')
    print(tokenizer_16000.convert_ids_to_tokens(encoded_text['input_ids']))

outputs:

16000
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁', 'З', 'ад', 'ов', '▁оп', 'я', 'ть', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁Пр', 'ап', 'о', 'р', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['▁', 'П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['<unk>', '▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁оп', 'ят', 'ь', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'о', 'в', '▁опять', '▁здесь', '.', '</s>']
['▁П', 'ра', 'пор', 'щик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']
['ра', 'пор', 'щ', 'ик', '▁З', 'ад', 'ов', '▁опять', '▁здесь', '.', '</s>']

It never happens with vocab size 8000 with all same parameters. Why do <unk> appear there?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions