Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to deal with special tokens for multiple files #44

Open
IamExperimenting opened this issue Feb 24, 2024 · 0 comments
Open

how to deal with special tokens for multiple files #44

IamExperimenting opened this issue Feb 24, 2024 · 0 comments

Comments

@IamExperimenting
Copy link

Hi,

I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention

  1. <|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
  2. or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
  3. minbpe is capable of handling those on it own?
  4. is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)

can you please help me understand here @karpathy

@IamExperimenting IamExperimenting changed the title how deal special tokens with encoding multiple files how to deal with special tokens for multiple files Feb 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant