benchmark

TRL Benchmark

This is a benchmark for TRL. Here we show the command to run it in a slurm cluster, but it can be easily adapted to run locally.

There are several benchmark axes we want to explore:

w/ different models (gpt2, gpt2-xl, falcon, llama2)
- key research engineering questions
  - how do different model sizes scale?
  - given that the preference labels come from a source model M_s (e.g., gpt2), how does that affect the performance of a target model M_t (e.g., falcon, gptj, llama2)?
    - This is actually an important assumption we have been operating.
w/ and w/o gradient accumulation / multi-GPU
- key research engineering question: do we need to whiten advantage across the entire batch?
w/ and w/o peft
- key research engineering question: how well does PEFT work with RL
w/ and w/o quantization or 4 bits
- key research engineering question: how well does quantization work with RL training
w/ and w/o deepspeed
- sanity check to make sure it works.
w/ different datasets
- TRL’s typical imdb sentiment
- OAI’s sentiment dataset (https://github.com/openai/lm-human-preferences)
- summarize from feedback ( https://github.com/openai/summarize-from-feedback)
- helpfulness vs harmlessness (https://huggingface.co/datasets/Anthropic/hh-rlhf)

Benchmark commands

export WANDB_ENTITY=huggingface
python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template

w/ and w/o gradient accumulation

python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.exp_name sentiment_tuning_step_grad_accu --ppo_config.mini_batch_size 1 --ppo_config.gradient_accumulation_steps 128 --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template

w/ different models (gpt2, gpt2-xl, falcon, llama2)

python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.exp_name sentiment_tuning_gpt2 --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.exp_name sentiment_tuning_gpt2xl_grad_accu --ppo_config.model_name gpt2-xl --ppo_config.mini_batch_size 16 --ppo_config.gradient_accumulation_steps 8 --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template
python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.exp_name sentiment_tuning_falcon_rw_1b --ppo_config.model_name tiiuae/falcon-rw-1b --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template

w/ and w/o PEFT

python benchmark/benchmark.py \
    --command "python examples/scripts/sentiment_tuning.py --ppo_config.exp_name sentiment_tuning_peft --use_peft --ppo_config.log_with wandb" \
    --num-seeds 5 \
    --start-seed 1 \
    --workers 10 \
    --slurm-nodes 1 \
    --slurm-gpus-per-task 1 \
    --slurm-ntasks 1 \
    --slurm-total-cpus 12 \
    --slurm-template-path benchmark/trl.slurm_template

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
benchmark.py		benchmark.py
benchmark_aux.sh		benchmark_aux.sh
benchmark_core.sh		benchmark_core.sh
plot.sh		plot.sh
trl.slurm_template		trl.slurm_template
upload_benchmark.py		upload_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

README.md

TRL Benchmark

Benchmark commands

w/ and w/o gradient accumulation

w/ different models (gpt2, gpt2-xl, falcon, llama2)

w/ and w/o PEFT

Files

benchmark

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmark

Folders and files

parent directory

README.md

TRL Benchmark

Benchmark commands

w/ and w/o gradient accumulation

w/ different models (gpt2, gpt2-xl, falcon, llama2)

w/ and w/o PEFT