Skip to content

Steps forward - TokenizerΒ #37

Open
@cmp-nct

Description

I'm currently working on the tokenizer, we need a new one.

The llama tokenizer is not suitable, it has problems forming larger tokens and favors smaller ones and it does not adhere to the merge priority of bpe, instead uses sentencepiece scores.

That's why the progress on the roadmap has stopped a bit, without good tokenization Falcon can not provide good quality results.

Couple problems to be solved:

  1. BPE merge logic instead of scores
  2. current tokenization of whitespaces conflicts with BPE whitespace token merging (whitespace and multi-whitespace binding to each other)
    2.1) Same problem with newlines, these are actual tokens and can be combined and interleaved with spaces forming pure whitespace tokens (most likely a lot for code)
  3. vocabulary in ggml V3 is not fit for the purpose

For good quality Falcon flapping we need the tokenizer to be identical or almost identical to the training tokenization

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions