how to deal with special tokens for multiple files #44

IamExperimenting · 2024-02-24T12:01:29Z

Hi,

I have a question regarding Byte-Pair Encoding - Special tokens especailly, I have 1780 file with me which is my domain dataset, do I need to mention

<|startoftext|> in the beginning of the text in each file and <|endoftext|> in the end of the text in the each file?
or do I need to combine all 1780 files together as one? and mention <|endoftext|> at the end of text of each file, as Andrej mentioned this will let the model to consider as delimiter.
minbpe is capable of handling those on it own?
is there any specific format that I should prepare my data and pass to minbpe? like dataframe(each text file in each row)

can you please help me understand here @karpathy

IamExperimenting changed the title ~~how deal special tokens with encoding multiple files~~ how to deal with special tokens for multiple files Feb 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to deal with special tokens for multiple files #44

how to deal with special tokens for multiple files #44

IamExperimenting commented Feb 24, 2024

how to deal with special tokens for multiple files #44

how to deal with special tokens for multiple files #44

Comments

IamExperimenting commented Feb 24, 2024