Open
Description
I'm currently working on the tokenizer, we need a new one.
The llama tokenizer is not suitable, it has problems forming larger tokens and favors smaller ones and it does not adhere to the merge priority of bpe, instead uses sentencepiece scores.
That's why the progress on the roadmap has stopped a bit, without good tokenization Falcon can not provide good quality results.
Couple problems to be solved:
- BPE merge logic instead of scores
- current tokenization of whitespaces conflicts with BPE whitespace token merging (whitespace and multi-whitespace binding to each other)
2.1) Same problem with newlines, these are actual tokens and can be combined and interleaved with spaces forming pure whitespace tokens (most likely a lot for code) - vocabulary in ggml V3 is not fit for the purpose
For good quality Falcon flapping we need the tokenizer to be identical or almost identical to the training tokenization
Metadata
Assignees
Labels
No labels