[GRPO] initial GRPO trainer #1954

saisurbehera · 2024-08-21T02:17:53Z

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

Will be adding iterative reward model training
Only outcome supervision has been enabled, will be implementing process supervision later

lewtun · 2024-08-21T08:19:10Z

Thank you for working on this nifty algorithm @saisurbehera ! I see you're basing your implementation on PPOTrainer but we've recently overhauled our RL implementations to be more aligned with the rest of the library, e.g. here's the new PPO version: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py

Would you mind adapting your implementation to this new API? Since GRPO is somewhat similar to RLOO, you might find it is possible to copy-paste a large part of that code: https://github.com/huggingface/trl/blob/main/trl/trainer/rloo_trainer.py

saisurbehera · 2024-08-21T14:34:23Z

Sure, i can make the changes similar to PPOtrainerv2

saisurbehera · 2024-08-22T03:03:16Z

Hello @lewtun ,

I ported the format to the new methodlogy, it was way simpler than the first version. I still have to do some validations and testing.

Namco0816 · 2024-11-28T06:50:48Z

Thanks for you contribution!

I've also implement a version of GRPO trainer, instead of using a for loop in https://github.com/saisurbehera/trl/blob/grpo/trl/trainer/grpo_trainer.py#L380, I directly view it to (-1, sampling_group_size) and calculate the normalized_group_scores in a tensor-friendly way and then view it back to the original shape. I am not sure if this will help to optimize the performance.

saisurbehera · 2024-11-29T01:28:21Z

I think we should test it out. Thanks a lot for the change.

Overall, I do think most of my work is done based on the limits of trl. We need more extensive changes to add PRM and reward model training.

Sorry for not spending some more time on it, I was busy at work and family stuff.

fzyzcjy · 2024-11-30T06:47:51Z

Hi, thanks for the PR, it would be great to have GRPO and looking forward to it!

rawsh · 2024-12-17T10:15:48Z

@saisurbehera curious if it's ready to test? want to try my hand at hacking in PRM rewards

saisurbehera · 2024-12-17T15:10:07Z

Go ahead

fzyzcjy · 2024-12-22T09:45:31Z

Hi, is there any updates? I would appreciate it if this could be merged!

saisurbehera · 2024-12-25T19:11:41Z

Let me work over this weekend to verify, Sorry for the delay.

fzyzcjy · 2024-12-25T23:31:35Z

Looking forward to it!

hijkzzz · 2024-12-27T03:52:58Z

REINFORCE++ is better than GRPO: https://www.researchgate.net/publication/387487679_REINFORCE_A_SIMPLE_AND_EFFICIENT_APPROACH_FOR_ALIGNING_LARGE_LANGUAGE_MODELS

fzyzcjy · 2024-12-31T23:49:39Z

Hi is there any updates?

saisurbehera · 2025-01-02T03:30:53Z

Sorry for the late response. My code works now. The problem is the new model has very high KL divergence compared to the reference model. The scores compared to rloo don't look right. I have to debug as to why. Sorry for the it.

fzyzcjy · 2025-01-02T03:35:22Z

It's OK and looking forward to the fix!

ehartford · 2025-01-05T20:20:34Z

Do you have example training dataset?

initial grpo files

7200785

saisurbehera changed the title ~~initial grpo files~~ [GRPO] initial GRPO trainer Aug 21, 2024

fixed thw commit

4cee7cb

saisurbehera marked this pull request as draft August 21, 2024 02:24

port to new format, completed basics , requires testing

96f5c07

saisurbehera and others added 6 commits August 21, 2024 23:06

Merge branch 'main' into grpo

150ba02

remove the core changes

bd5084e

Merge branch 'grpo' of github.com:saisurbehera/trl into grpo

68a42b8

added example script

eea6d4c

Added some checks and validated the results

d40a433

Merge branch 'main' into grpo

9e50e7b

This was referenced Sep 24, 2024

Group Relative Policy Optimization Trainer #1583

Closed

GRPO as part of HF TRL? #2103

Open

saisurbehera and others added 6 commits October 4, 2024 17:45

fixed vals

4b27d5d

Merge branch 'grpo' of github.com:saisurbehera/trl into grpo

f8f6c08

Merge branch 'main' into grpo

3a312e4

added delta

6cc2c9f

merge to main

33f11ef

Added some documentation

f308e3a

Merge branch 'huggingface:main' into grpo

513a905

Added files

ee0607a

saisurbehera added 2 commits January 2, 2025 19:58

added the fix

d3ad51e

added the fix

15614a0

saisurbehera closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO] initial GRPO trainer #1954

[GRPO] initial GRPO trainer #1954

saisurbehera commented Aug 21, 2024 •

edited by qgallouedec

Loading

lewtun commented Aug 21, 2024

saisurbehera commented Aug 21, 2024

saisurbehera commented Aug 22, 2024

Namco0816 commented Nov 28, 2024

saisurbehera commented Nov 29, 2024

fzyzcjy commented Nov 30, 2024

rawsh commented Dec 17, 2024

saisurbehera commented Dec 17, 2024

fzyzcjy commented Dec 22, 2024 •

edited

Loading

saisurbehera commented Dec 25, 2024

fzyzcjy commented Dec 25, 2024

hijkzzz commented Dec 27, 2024 •

edited

Loading

fzyzcjy commented Dec 31, 2024

saisurbehera commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

ehartford commented Jan 5, 2025

[GRPO] initial GRPO trainer #1954

[GRPO] initial GRPO trainer #1954

Conversation

saisurbehera commented Aug 21, 2024 • edited by qgallouedec Loading

Still a work in progress

lewtun commented Aug 21, 2024

saisurbehera commented Aug 21, 2024

saisurbehera commented Aug 22, 2024

Namco0816 commented Nov 28, 2024

saisurbehera commented Nov 29, 2024

fzyzcjy commented Nov 30, 2024

rawsh commented Dec 17, 2024

saisurbehera commented Dec 17, 2024

fzyzcjy commented Dec 22, 2024 • edited Loading

saisurbehera commented Dec 25, 2024

fzyzcjy commented Dec 25, 2024

hijkzzz commented Dec 27, 2024 • edited Loading

fzyzcjy commented Dec 31, 2024

saisurbehera commented Jan 2, 2025

fzyzcjy commented Jan 2, 2025

ehartford commented Jan 5, 2025

saisurbehera commented Aug 21, 2024 •

edited by qgallouedec

Loading

fzyzcjy commented Dec 22, 2024 •

edited

Loading

hijkzzz commented Dec 27, 2024 •

edited

Loading