-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative to bpe #50
Comments
I don't understand this question. Take a closer look at for example the gpt-4 tokenizer. It basically splits the text up into words. How does this encoding not make sense? |
Well, I did look at the cl100k_base tokens. And sure it has lots of words in them. And lots of wordpieces that make sense to me. But also just as an example: 26217 - ancellationToken Not to mention things like: 28424 - ]]) I hope this clarifies what I mean it doesn't make much sense. |
I was thinking a scheme where the tokeniser doesn't just give a sequence of tokens but also a set of modifiers to the tokens. The integer we would use as normal to lookup the token embedding, but the modifiers we map onto synthetic (non-trained) dimensions of the token embedding. The tokeniser would have to be very different from just doing bpe. Instead we would make an attempt of building a tokeniser using a-priori knowledge of different languages. Just as an example: " bus", "Bus", "busses" for instance would all map to the same integer and the synthetic dimensions would provide the information of the specific variation. Maybe language could be a modifier too. So that the same word in different languages also have the same integer. |
That's an interesting idea. Normally this mapping happens automatically during training, meaning " bus", "Bus" and "busses" get mapped to a similar point in the embedding space. The unique "variation" is simply learned by the language model. Also i don't see a simple way to calculate the modifiers or knowing which words should map to the same integer. But i don't want to discourage you. Maybe you can take an english dictionary and find the variations of each word by some predefined rules. For example you could say, for each verb, you map every variation to the same integer, and then take unique modifiers for each combination of singluar/plural, person, etc. and add them after the embedding. I assume this is what you meant. Indeed, i think this would make it easier for the model to learn, so this is a very nice idea. However you need an english dictionary for this and you also need to know what to do with words that are not in the dictionary. It's not easy to do this and there are probably many challenges that will arise. |
Yes, I do not expect it to be easy. But then again the current approach shows us that our encoding doesn't have to be perfect. A hybrid approach maybe. Partly crafted/designed and using bpe to fill in the blanks. |
Yes, i suggest trying it out on the TinyStories dataset as it only contains simple english words. I would first take a modern BPE, maybe the one from this repo and compare it with yours. But you have to be very careful on how you measure the performance. Maybe compare model parameters with loss? |
You can do a lazy evaluation. For example, just lowercase each input, check if it's part of the subsequent word, and if it is, map it to the same id. "bus" in "Bus".lower() # True
"bus" in "busses" # True
# etc This would require some form of pre-tokenization though. Keep in mind the more complex it becomes, the more likely you've gone down the wrong path. The idea and goal would be to remain as general as possible. |
@NiceGuySaysHi Thank you for the suggestion. It was honestly helpful. I think your suggestion of taking a modern bpe is good. If the tokeniser gets trained on the entire training data then it would result in a best-case bpe based encoder. This encoding would be the shortest encoding for the training data given a certain dictionary size. And would be the most efficient encoding in terms of required compute per training cycle. |
We looked into what makes good tokenizations from mathematical perspective and found that Rényi entropy is a good ranker across multiple tokenizations. That is if you tokenize either with BPE, morphologically, or with some other algorithm it tends to select the tokenization which will lead to the highest model performance. There's a tool for this evaluation I'd be super interested what kind of tokenization maximizes this metric (it's not BPE) and whether it's really the best one from performance perspective. It's a bit tangential to the original question but I still wonder what tokenization could directly optimize this. Or maybe we could come up with different metrics that measure something else (ie not model performance) based on the unigram(?) distribution of the tokenization. (apologies for pushing another paper here) |
Can we develop tiny language models based on characters (and or character groups)? Not word-based approaches, like tokens? So, content is splitting into n-chars as geometric algebra vectors, not a big chunk as tokens (some words and characters) That way even a dictionary can help the process |
@zouharvi You're fine. I am super interested to learn anything relating to encoding of text and downstream performance. I haven't properly read your paper yet, but I will. Don't even get me started on 24106 - .DataGridViewTextBoxColumn. WTAF. |
@marcov-dart I see your point. Morphological tokenization (e.g. using morfessor) might do just that. For example the word However the paper (and many other previous works) show that this is not the best tokenization. This makes me wonder what really makes good tokenization. [1] Please correct me if this is wrong. |
I feel certain that these problems are related to how human languages work and how the transformer works. In language there is lots of meaning in the separate words and lots of meaning in the structure of the sentence. The transformer learns the meaning of the tokens and learns what to pay attention to in terms of structure. This seems to fit well together which is nice and it explains why transformers work as well as they do. |
@marcov-dart that's why I mentioned a character-based or character-grouping (vowels, plosives, fricatives, etc.) approach! |
I have a question regarding the discussion that follows. Has anyone ever tried doing something like Backoff merging i.e First merge N tokens instead of pairs, then reduce N and finally come down to a pair. |
Library Is All You Need It will bring efficiency and consistency, scalability, adaptability, and "interoperability" in language models globally The idea is having a universal linguistic unit library/dictionary/model for representation and processing BPE doesn't have dictionary, it's dependent to the dictionary (vocab pool) LLM builds during the training I say let's have a separate (external) standard dictionary with universal indexed linguistic unites that is in hands of everyone Current LLMs, RAGs, and Vector DBs are based on processing "individual characters and words" which is inefficient There is a need for a universal linguistic unit library/dictionary/model for representation and processing to improve LLMs and reduce costs and hallucinations. We have Unicode characters |
Maybe I am completely wrong, but to me using something like bpe to build an encoding for text feels stupid. Sure, it is a fairly easy way and it will build an encoding that is efficient in terms of sequence length, but is that the only requirement for such an encoding? Would using an encoding that makes sense not make training and inference easier?
Should we not engineer an encoding instead? Using a-priori knowledge of languages from dictionaries for instance?
The text was updated successfully, but these errors were encountered: