You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention
<|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
minbpe is capable of handling those on it own?
is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)
The text was updated successfully, but these errors were encountered:
IamExperimenting
changed the title
how deal special tokens with encoding multiple files
how to deal with special tokens for multiple files
Feb 24, 2024
Hi,
I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention
can you please help me understand here @karpathy
The text was updated successfully, but these errors were encountered: