Closed
Description
Hi,
I ran the fine-tune script on Mistral base model but found rather poor results on ARC Challenge (<50% with retrieval). Any ideas why? I will repeat with Mistral Instruct to see if it makes a beneficial difference, but I am not optimistic as I have seen similar poor results when fine-tuning this model with the self-rag dataset and script.
MODEL_SIZE=7B
NUM_GPUS=8
BATCH_SIZE_PER_GPU=1
TOTAL_BATCH_SIZE=128
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))
echo "Training llama model ${MODEL_SIZE} using $NUM_GPUS GPUs, $BATCH_SIZE_PER_GPU batch size per GPU, $GRADIENT_ACC_STEPS gradient accumulation steps"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--mixed_precision bf16 \
--num_machines 1 \
--num_processes $NUM_GPUS \
--use_deepspeed \
--deepspeed_config_file stage3_no_offloading_accelerate.conf \
finetune.py \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--use_flash_attn \
--tokenizer_name mistralai/Mistral-7B-v0.1 \
--use_slow_tokenizer \
--train_file full_output_1005.jsonl \
--max_seq_length 2048 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size $BATCH_SIZE_PER_GPU \
--gradient_accumulation_steps $GRADIENT_ACC_STEPS \
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--weight_decay 0. \
--num_train_epochs 5 \
--output_dir output/mistral_root_${MODEL_SIZE}/ \
--with_tracking \
--report_to tensorboard \
--logging_steps 1 \
--use_special_tokens
EDIT: I had a chance to look into this today, I am fairly confident the issue is that this script will NOT work for a model that has not had the tokenizer independently prepared. Will confirm and close the issue - it might be nice to add some information on how to independently replicate the result fine-tuning from scratch.
Metadata
Assignees
Labels
No labels