Refactor and benchmark #662

vwxyzjn · 2023-08-18T21:58:14Z

This PR does a few refactor. Some raw thoughts:

For better experiment tracking, we should also create a few more variables
- query_dataset="imdb"
- reward_model="sentiment-analysis:lvwerra/distilbert-imdb", it could work with a pipeline or a trained reward model
By default, we probably should use a vanilla model like gpt2 in place of lvwerra/gpt2-imdb
By default, we should demonstrate end-to-end training (involving a reward model training then policy training).

We have multiple benchmark axes:

w/ different models (gpt2, gpt2-xl, falcon, llama2)
- key research engineering questions
  - how do different model sizes scale?
  - given that the preference labels come from a source model M_s (e.g., gpt2), how does that affect the performance of a target model M_t (e.g., falcon, gptj, llama2)?
    - This is actually an important assumption we have been operating.
w/ and w/o gradient accumulation / multi-GPU
- key research engineering question: do we need to whiten advantage across the entire batch?
w/ and w/o peft
- key research engineering question: how well does PEFT work with RL
w/ and w/o quantization or 4 bits
- key research engineering question: how well does quantization work with RL training
w/ and w/o deepspeed
- sanity check to make sure it works.
w/ different datasets
- TRL’s typical imdb sentiment
- OAI’s sentiment dataset (https://github.com/openai/lm-human-preferences)
- summarize from feedback ( https://github.com/openai/summarize-from-feedback)
- helpfulness vs harmlessness (https://huggingface.co/datasets/Anthropic/hh-rlhf)

We can probably have a train.py that can do

accelerate launch train.py --config deepspeed.yaml \
    --model_name falcon-40b \
    --query_dataset book \
    --label_dataset openai-sentiment-label \
    --gradient_accumulation_steps=12 \

Uses `tyro` to eliminate duplicate code / comments.

Help text also works

tyro

existing

@dataclass
class ScriptArguments:
    ppo: PPOConfig = field(
        default_factory=lambda: PPOConfig(
            model_name="lvwerra/gpt2-imdb",
            learning_rate=1.41e-5,
            log_with=None,
            mini_batch_size=128,
            batch_size=128,
            gradient_accumulation_steps=1,
            early_stopping=False,
            target_kl=6,
            kl_penalty="kl",
            seed=0,
        )
    )
args = tyro.cli(ScriptArguments)

print(args.ppo.seed)

@dataclass
class ScriptArguments:
    """
    The name of the Casual LM model we wish to fine with PPO
    """

    # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
    # models like gpt-neo* models are more suitable.
    model_name: Optional[str] = field(default="lvwerra/gpt2-imdb", metadata={"help": "the model name"})
    log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
    learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
    mini_batch_size: Optional[int] = field(default=128, metadata={"help": "the PPO minibatch size"})
    batch_size: Optional[int] = field(default=128, metadata={"help": "the batch size"})
    gradient_accumulation_steps: Optional[int] = field(
        default=1, metadata={"help": "the number of gradient accumulation steps"}
    )
    early_stopping: Optional[bool] = field(default=False, metadata={"help": "whether to early stop"})
    use_peft: Optional[bool] = field(default=False, metadata={"help": "whether to use peft"})
    use_seq2seq: Optional[bool] = field(default=False, metadata={"help": "whether to use seq2seq models"})
    kl_penalty: Optional[str] = field(
        default="kl",
        metadata={
            "help": "kl penalty options: 'kl': model_logp - ref_logp,  'abs': abs(kl),  'mse': mean squared error mse(kl) and 'full': the actual kl for all tokens in the distribution"
        },
    )
    target_kl: Optional[float] = field(default=0.1, metadata={"help": "kl target for early stopping"})
    seed: Optional[int] = field(default=0, metadata={"help": "the random seed"})
    use_score_scaling: Optional[bool] = field(default=False, metadata={"help": "Use score scaling"})
    use_score_norm: Optional[bool] = field(
        default=False, metadata={"help": "Use score normalization. Only applicable if use_score_scaling is True"}
    )
    score_clip: Optional[float] = field(default=None, metadata={"help": "Score clipping"})


parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]

config = PPOConfig(
    model_name=script_args.model_name,
    learning_rate=script_args.learning_rate,
    log_with=script_args.log_with,
    mini_batch_size=script_args.mini_batch_size,
    batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
    early_stopping=script_args.early_stopping,
    target_kl=script_args.target_kl,
    kl_penalty=script_args.kl_penalty,
    seed=script_args.seed,
    use_score_scaling=script_args.use_score_scaling,
    use_score_norm=script_args.use_score_norm,
    score_clip=script_args.score_clip,
)

more controlled terminology and tracking config

add accelerate logging

Log global_backward_batch_size global_batch_size world_size

HuggingFaceDocBuilderDev · 2023-08-18T22:02:28Z

The documentation is not available anymore as the PR was closed or merged.

vwxyzjn · 2023-09-06T14:13:19Z

Benchmark and documentation are ready at https://github.com/vwxyzjn/trl/blob/refactor-benchmark/benchmark/README.md. We can probably better tune some of these models for better performance in follow up PRs, including testing for deepspeed integration cc @lewtun.

main

with grad accu

with different models

with peft

lvwerra

Hi @vwxyzjn, looks really good to me. The openbenchmark code itself is a bit obscure to me so if we could document well how it works that would be great. Left some comments here and there but in general happy to add it. Since this is not strictly a part of the library and more part of our test suite we can also be a bit more experimental here :)

Regarding tyro - this looks good to me in general, would also love to hear @younesbelkada feedback who maybe also knows about design decisions in transformers about the CLI.

benchmark/plot.sh

benchmark/upload_benchmark.py

trl/trainer/ppo_config.py

benchmark/plot.sh

examples/scripts/sentiment_tuning.py

benchmark/README.md

benchmark/plot.sh

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

…nto refactor-benchmark

benchmark/upload_benchmark.py

vwxyzjn · 2023-09-08T15:37:20Z

Thanks @lvwerra! I have addressed the comments :)

younesbelkada

Thank a lot for this great work , as discussed offline :D !
Feel free to merge once the CI is green

trl/trainer/ppo_trainer.py

examples/scripts/sentiment_tuning.py

* refactor and benchmark * update code * Add accelerate logging * logs * quick fix * update config * precommit * modify training example * fix multi-gpu all_reduce error `Tensors must be CUDA and dense` * support more models and benchmark * update * add changes * upload benchmark * precommit * add tyro as a dependency * add tyro * pre-commit * precommit * weird... * lol typo * precommit * sigh * push changes * Update benchmark/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * Add experiments * upload image to tag specific folder * add openrlbenchmark documentation * rename * remove unused field * precommit * push changes --------- Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

vwxyzjn added 2 commits August 18, 2023 20:45

refactor and benchmark

586d594

update code

d58f955

vwxyzjn added 7 commits August 21, 2023 14:06

Add accelerate logging

334f7a7

logs

05719be

quick fix

683e6bc

update config

9416305

precommit

251face

modify training example

0e6fa6a

fix multi-gpu all_reduce error Tensors must be CUDA and dense

110e672

vwxyzjn mentioned this pull request Aug 21, 2023

[REFACTOR] unify API-usage PPOTrainer #650

Closed

vwxyzjn added 6 commits August 23, 2023 14:53

support more models and benchmark

7e7618e

update

873b1f8

add changes

1537f69

upload benchmark

0bf82e5

Merge branch 'main' into refactor-benchmark

48ae8de

precommit

c460043

vwxyzjn marked this pull request as ready for review September 6, 2023 14:08

vwxyzjn requested review from lvwerra, younesbelkada and lewtun September 6, 2023 14:08

vwxyzjn added 8 commits September 6, 2023 10:14

add tyro as a dependency

6f07de1

add tyro

e5e3153

pre-commit

6201479

precommit

24662cc

weird...

766d159

lol typo

e85d4e0

precommit

5e2644f

sigh

007375f

lvwerra reviewed Sep 8, 2023

View reviewed changes

vwxyzjn and others added 6 commits September 8, 2023 09:16

push changes

7a4e276

Update benchmark/README.md

22e0681

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Add experiments

9d3140f

Merge branch 'refactor-benchmark' of https://github.com/vwxyzjn/trl i…

49f8777

…nto refactor-benchmark

upload image to tag specific folder

df70561

add openrlbenchmark documentation

e58ab97

lvwerra reviewed Sep 8, 2023

View reviewed changes

benchmark/upload_benchmark.py Outdated Show resolved Hide resolved

vwxyzjn added 3 commits September 8, 2023 11:35

rename

10e011a

remove unused field

d05481e

precommit

8a835f4

vwxyzjn added 2 commits September 13, 2023 14:01

push changes

83bf358

Merge branch 'main' into refactor-benchmark

be85780

younesbelkada approved these changes Sep 13, 2023

View reviewed changes

vwxyzjn commented Sep 13, 2023

View reviewed changes

trl/trainer/ppo_trainer.py Show resolved Hide resolved

examples/scripts/sentiment_tuning.py Show resolved Hide resolved

vwxyzjn merged commit e4f9a48 into huggingface:main Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and benchmark #662

Refactor and benchmark #662

vwxyzjn commented Aug 18, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 18, 2023 •

edited

Loading

vwxyzjn commented Sep 6, 2023

lvwerra left a comment

vwxyzjn commented Sep 8, 2023

younesbelkada left a comment

Refactor and benchmark #662

Refactor and benchmark #662

Conversation

vwxyzjn commented Aug 18, 2023 • edited Loading

Uses tyro to eliminate duplicate code / comments.

more controlled terminology and tracking config

add accelerate logging

HuggingFaceDocBuilderDev commented Aug 18, 2023 • edited Loading

vwxyzjn commented Sep 6, 2023

lvwerra left a comment

Choose a reason for hiding this comment

vwxyzjn commented Sep 8, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

vwxyzjn commented Aug 18, 2023 •

edited

Loading

Uses `tyro` to eliminate duplicate code / comments.

HuggingFaceDocBuilderDev commented Aug 18, 2023 •

edited

Loading