Fixed GPT-125m

Hugginface doesn't have 125M, instead has 125m
Muennighoff · Jul 6, 2023 · 120b42e · 120b42e
1 parent 3940306
commit 120b42e
Showing 1 changed file with 30 additions and 30 deletions.
diff --git a/biencoder/nli_msmarco/README.md b/biencoder/nli_msmarco/README.md
@@ -70,18 +70,18 @@ accelerate config
 accelerate launch examples/training/msmarco/training_nli_v2.py --modelname bert-large-uncased
 ```
 
-Training of `SGPT-125M-mean-nli` on 1 24GiB GPU:
+Training of `SGPT-125m-mean-nli` on 1 24GiB GPU:
 
 ```bash
 accelerate config
-accelerate launch --main_process_port 1469 examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --pooling mean
+accelerate launch --main_process_port 1469 examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --pooling mean
 ```
 
-Training of `SGPT-125M-weightedmean-nli` on 1 24GiB GPU::
+Training of `SGPT-125m-weightedmean-nli` on 1 24GiB GPU::
 
 ```bash
 accelerate config
-accelerate launch --main_process_port 1469 examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --pooling weightedmean
+accelerate launch --main_process_port 1469 examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --pooling weightedmean
 ```
 
 Training of `SGPT-1.3B-mean-nli` on 8 40GiB GPUs::
@@ -100,25 +100,25 @@ accelerate config
 accelerate launch --main_process_port 1469 examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-1.3B --train_batch_size 6 --lr 1e-5 --pooling weightedmean
 ```
 
-Training of `SGPT-125M-mean-nli-linear5` on 4 40GiB GPUs:
+Training of `SGPT-125m-mean-nli-linear5` on 4 40GiB GPUs:
 
 ```bash
 accelerate config
-accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --freeze --addxlinear 5 --wandb --useact
+accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --freeze --addxlinear 5 --wandb --useact
 ```
 
-Training of `SGPT-125M-mean-nli-linearthenpool5` on 4 40GiB GPUs:
+Training of `SGPT-125m-mean-nli-linearthenpool5` on 4 40GiB GPUs:
 
 ```bash
 accelerate config
-accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --freeze --addxlinear 5 --linearthenpool --wandb --useact
+accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --freeze --addxlinear 5 --linearthenpool --wandb --useact
 ```
 
-Training of `SGPT-125M-weightedmean-nli-linearthenpool5` on 4 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-nli-linearthenpool5` on 4 40GiB GPUs:
 
 ```bash
 accelerate config
-accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --freeze --addxlinear 5 --linearthenpool --wandb --pooling weightedmean --useact
+accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --freeze --addxlinear 5 --linearthenpool --wandb --pooling weightedmean --useact
 ```
 
 Training of `SGPT-1.3B-mean-nli-linearthenpool5` on 4 40GiB GPUs:
@@ -155,13 +155,13 @@ accelerate launch examples/training/nli/training_nli_v2.py --model_name Eleuther
 Models with larger batch size (These are the ones used in the paper for the most part). The models use GradCache, a technique for gradient accumulation with contrastive learning. 
 
 
-Training of `SGPT-125M-weightedmean-nli-bitfit` on 8 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-nli-bitfit` on 8 40GiB GPUs:
 
 ```bash
 accelerate config
 cd sentence-transformers
 accelerate config
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125M --freezenonbias --train_batch_size 128 --lr 32e-5 --pooling weightedmean --wandb --wandbwatchlog gradients --gradcache --chunksize 128
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch examples/training/nli/training_nli_v2.py --model_name EleutherAI/gpt-neo-125m --freezenonbias --train_batch_size 128 --lr 32e-5 --pooling weightedmean --wandb --wandbwatchlog gradients --gradcache --chunksize 128
 ```
 
 
@@ -224,57 +224,57 @@ accelerate config
 CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2223 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name bert-base-uncased --train_batch_size 32 --freezenonbias --asym --wandb --wandbwatchlog gradients
 ```
 
-Training of `SGPT-125M-weightedmean-msmarco` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco` on 2 40GiB GPUs:
 
 ```bash
 accelerate config
-CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --wandb --wandbwatchlog gradients
+CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --wandb --wandbwatchlog gradients
 ```
 
-Training of `SGPT-125M-weightedmean-msmarco-asym` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-asym` on 2 40GiB GPUs:
 
 ```bash
 accelerate config
-CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --asym --wandb --wandbwatchlog gradients
+CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --asym --wandb --wandbwatchlog gradients
 ```
 
-Training of `SGPT-125M-weightedmean-msmarco-bitfit` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-bitfit` on 2 40GiB GPUs:
 
 ```bash
 accelerate config
-CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --freezenonbias --lr 2e-4 --wandb --wandbwatchlog gradients
+CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --freezenonbias --lr 2e-4 --wandb --wandbwatchlog gradients
 ```
 
 
-Training of `SGPT-125M-weightedmean-msmarco-bitfit` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-bitfit` on 2 40GiB GPUs:
 
 ```bash
 accelerate config
-CUDA_VISIBLE_DEVICES=0,1 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --freezenonbias --lr 8e-4 --wandb --wandbwatchlog gradients
+CUDA_VISIBLE_DEVICES=0,1 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --freezenonbias --lr 8e-4 --wandb --wandbwatchlog gradients
 ```
 
-Training of `SGPT-125M-weightedmean-msmarco-speca-bitfit` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-speca-bitfit` on 2 40GiB GPUs:
 
 ```bash
-CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --wandb --wandbwatchlog gradients --speca --pooling weightedmean
+CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --wandb --wandbwatchlog gradients --speca --pooling weightedmean
 ```
 
-Training of `SGPT-125M-lasttoken-msmarco-speca-bitfit` on 2 40GiB GPUs:
+Training of `SGPT-125m-lasttoken-msmarco-speca-bitfit` on 2 40GiB GPUs:
 
 ```bash
-CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --wandb --wandbwatchlog gradients --speca --pooling lasttoken
+CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --wandb --wandbwatchlog gradients --speca --pooling lasttoken
 ```
 
-Training of `SGPT-125M-weightedmean-msmarco-specb-bitfit` on 2 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-specb-bitfit` on 2 40GiB GPUs:
 
 ```bash
-CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --wandb --wandbwatchlog gradients --specb --pooling weightedmean
+CUDA_VISIBLE_DEVICES=4,5 accelerate launch --main_process_port 2224 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --wandb --wandbwatchlog gradients --specb --pooling weightedmean
 ```
 
-Training of `SGPT-125M-lasttoken-msmarco-specb-bitfit` on 2 40GiB GPUs::
+Training of `SGPT-125m-lasttoken-msmarco-specb-bitfit` on 2 40GiB GPUs::
 
 ```bash
-CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --wandb --wandbwatchlog gradients --specb --pooling lasttoken
+CUDA_VISIBLE_DEVICES=6,7 accelerate launch --main_process_port 2225 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --wandb --wandbwatchlog gradients --specb --pooling lasttoken
 ```
 
 Training of `SGPT-1.3B-weightedmean-msmarco-specb-bitfit` on 8 40GiB GPUs:
@@ -313,12 +313,12 @@ If unspecified in the arguments, batch size is always 64 & lr is 2e-5 (argparse
 
 Models with larger batch size (These are the ones used in the paper for the most part). The models use GradCache, a technique for gradient accumulation with contrastive learning. 
 
-Training of `SGPT-125M-weightedmean-msmarco-specb-bitfit` on 8 40GiB GPUs:
+Training of `SGPT-125m-weightedmean-msmarco-specb-bitfit` on 8 40GiB GPUs:
 
 ```
 cd sentence-transformers
 accelerate config
-CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --main_process_port 2222 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125M --train_batch_size 32 --eval_batch_size 16 --freezenonbias --specb --lr 4e-4 --wandb --wandbwatchlog gradients --pooling weightedmean --gradcache --chunksize 32
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --main_process_port 2222 examples/training/ms_marco/train_bi-encoder_mnrl.py --model_name EleutherAI/gpt-neo-125m --train_batch_size 32 --eval_batch_size 16 --freezenonbias --specb --lr 4e-4 --wandb --wandbwatchlog gradients --pooling weightedmean --gradcache --chunksize 32
 ```
 
 Training of `SGPT-1.3B-weightedmean-msmarco-specb-bitfit` on 8 40GiB GPUs: