Skip to content

With unigram algorithm, constant piece at end of each sentences does not become a token #1047

Open
@jogardi

Description

Hi thanks for your great work on this. I noticed a subtle issue when playing with synthetic examples.

The bpe algorithm works as expected but the unigram algorithm does not make this constant piece a token in the vocabulary.
I generate synthetic data where each sentence is a random string followed by a constant piece.

constant_piece = 'helloWorld'
def rand_str(n=10):
    return ''.join(
        np.random.choice(list('bcegijklmnoqruvwxyz'), n)
    )

data = [rand_str() + constant_piece for _ in range(1000)]
model = io.BytesIO()
spm.SentencePieceTrainer.train(
      sentence_iterator=iter(data), model_writer=model, 
    vocab_size=1000,
    minloglevel=5, 
)
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())

ex = data[20]
print([
    sp.IdToPiece(x)
    for x in sp.encode(ex, emit_unk_piece=True)
])

outputs: ['▁uy', 'vx', 'yf', 'p', 'gmn', 'he', 'llo', 'W', 'or', 'ld']

It mostly just gets random tokens. I think it gets 'he', 'llo', 'or' and 'ld' not because it noticed the repeating pattern but just by coincidently seeing it in the random strings. If I change constant_piece to '123456' then i get no tokens for the repeating pattern and only tokens for the random string: ['▁', 'gll', 'imq', 'xc', 'df', '1', '2', '3', '4', '5', '6']

This specifically because the constant_piece at the end. If I change data so that constant_piece is at the beginning of each sentence: data = [constant_piece + rand_str() for _ in range(1000)] then i get the expected result ['▁123456', 'uzb', 'ek', 'hoe', 'wr'].

TLDR;
Unexpected result under the following conditions:

  • same string at end of each sentence in the training data
  • using unigram algorithm

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions