Open Source AI Train Codes (Microsoft Samba)
- Статус: Closed
- Винагорода: $300
- Надіслані роботи: 10
Стисле описання конкурсу
We need some work on the code on https://github.com/microsoft/Samba repo in order to make improvements and easy/friendless overall model training process:
- preprocess (tokenize) data script from files on a specific local folder or Hugginface repo. The --source_path flag have to be:
> the Hugginface path - like "--source_path HuggingFaceFW/fineweb-edu/tree/main/data/CC-MAIN-2013-20", where before "/tree/" we have the repo name and after "/tree/" we have {branch}/path_to_folder or
> a local path, like "--source_path /var/datasets/fineweb-edu/data/CC-MAIN-2013-20".
> The accepted formats data files is parquet or JSONL;
- pretrain script have to generate and save a 421M parameters model in a single Runpod 8xH100 node with data (train and validation) preprocessed in the previous step;
- For instruction training and align, have to use the following formats:
> https://huggingface.co/datasets/yahma/alpaca-cleaned (instruction, input, output)
> https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered (conversations)
> https://huggingface.co/datasets/TIGER-Lab/WebInstructSub (question/answer)
Preprocess for instruction training/align have to use the same script used for pretraining.
- Inference script using cpu and cuda;
- Explain on README how to use https://github.com/EleutherAI/lm-evaluation-harness for model eval (optional).
All training codes have to show the progress of loss eval in the terminal. The winner have to fork and publish in your own github profile and give me the link; or have to send me direct the code if prefer.
We will evaluate the codes in an 8xH100 PCIe node (if loss decrease and converge) and make it public under the same license as the original (MIT license). The acknowledgment will be give to the winner unless they ask not be cited.
We will make some experiments and publish the checkpoints eventually.
We will not give any resource to develop and test the codes, the developer/engineer have to test the codes using your own before send as finished work.
Рекомендовані навички
Найкращі роботи цього конкурсу
-
fizzaadeel Pakistan
-
Babar9890 Pakistan
-
Guna75 India
-
komla100 Bangladesh
-
websolution98 Bangladesh
-
Lucifer0o0 Turkey
-
mdsaiful963bd Bangladesh
-
Muzafarbaloch Pakistan
-
choubeyanubhav India
-
desnerbdnz Bangladesh
Публічна роз'яснювальна дошка
Як організувати конкурс
-
Опублікуйте свій конкурс Швидко та просто
-
Отримайте безліч конкурсних робіт З усього світу
-
Оберіть найкращу роботу Завантажуйте файли - це просто!