-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
counting pairs is inaccurate for repeating tokens? #51
Comments
That's an interesting observation! I think it's not a problem in a real scenario:
|
Yes, i think it is not that bad in a real life scenario i guess. However look at what happens in our example: 1, 1, 1, 1, 2, 1, 2, 1, 2 -> 3,3,2,1,2,1,2. If we pick (1,2) now, then we can only substitute it 2 times. In the original sequence we could have substituted it 3 times. I'm not sure how bad it is, but i noticed (after writing this issue), that this exact problem is mentioned in the paper from the first issue. They also use the same counting function as in this repo and basically say that this is not that important. But i think you should have it in the Back of your mind |
We mention it in the paper page 3 top left and have an algorithm with the adjustment on the last page (screenshot here). We were particular about it in the paper because it does have formal implications regarding the proofs. For them we want the quantity to be how much is compressed in the next step because pair frequency is not that. Libraries usually fix it but imo this does not have a big impact practically on natural language (could be if there are super long repeated strings in the data such as |
yes, thank's for your reply. I already looked at your paper which you already linked in the first issue. Personally i think the algorithm is quite nice and i am currently implementing it inside c++ and call it in python using ctypes. I hope to get a much faster implementation that way. If you want to, i can link it to you when it is finished. Another thing i noticed is that the encoding function could also be much faster by using a linked list. When iterating through each pair from the tokenizer the linked list can then be adjusted similar to your algorithm. Im not sure what the time complexity is for this, but obviously it is much faster. Im guessing something like O(n+m), where n is the sequence length and m the number of tokens, but i haven't thought much about it. |
I submitted PR #51 for it. |
I just noticed that counting pairs might be slightly inaccurate for a lot of repeating tokens. For example in the sequence 1, 1, 1, 1 the pair (1, 1) gets counted 3 times, but we can only replace it two times. This is slightly inaccurate because for example in the sequence 1, 1, 1, 1, 2, 1, 2, 1, 2 we might choose the pair (1, 1) over (1, 2)
The text was updated successfully, but these errors were encountered: