Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to „Vector Representation Pre-training“ possible? #30

Open
Heavy02011 opened this issue Feb 21, 2024 · 1 comment
Open

Comments

@Heavy02011
Copy link

Heavy02011 commented Feb 21, 2024

In the paper Driving with LLMs a „Vector Representation Pre-training“ is used to interact with llama-7b as follows: „In our framework, we aim to convert vector representations into language using a structured language generator to facilitate the grounding the vector representation into LLMs. Since our object-level vectors contain semantically significant attributes, such as the number of cars and pedestrians, their respective locations, orientations, speeds, bounding boxes and other attributes, we employ a structured language generator (lanGen) function to craft pseudo-language labels derived from the vector space…“

Question: can we apply / modify minbpe to achieve the same? Or is this a silly question…

@Heavy02011 Heavy02011 changed the title Alternative to „Vector Representation Pre-training“ Alternative to „Vector Representation Pre-training“ possible? Feb 21, 2024
@marcov-dart
Copy link

Sounds interesting. I should read the paper first, but just going of the text you provided I am unsure how you are relating it to bpe. They are going from vector representations to pseudo-language. Bpe allows tekst to be encoded as a sequence of integers where the integers then are used to lookup the vector representation. So the other way around basically. But maybe you are thinking about reversing the principle?

It is something I have been thinking about and I tried to get some discussion started in another post. Not much response as of yet unfortunately.
I was thinking a scheme where the tokeniser doesn't just give a sequence of tokens but also a set of modifiers to the tokens. The integer we would use as normal to lookup the token embedding, but the modifiers we map onto synthetic (non-trained) dimensions of the token embedding.

The tokeniser would have to be very different from just doing bpe. Instead we would make an attempt of building a tokeniser using a-priori knowledge of different languages. Just as an example:

" bus", "Bus", "busses" for instance would all map to the same integer and the synthetic dimensions would provide the information of the specific variation. Maybe language could be a modifier too. So that the same word in different languages also have the same integer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants