Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRPO] initial GRPO trainer #1954

Closed
wants to merge 19 commits into from
Closed

Conversation

saisurbehera
Copy link

@saisurbehera saisurbehera commented Aug 21, 2024

Implementation of the DeepSeekMath GRPO: https://arxiv.org/pdf/2402.03300

Still a work in progress

  • Will be adding iterative reward model training
  • Only outcome supervision has been enabled, will be implementing process supervision later

closes #2103

@saisurbehera saisurbehera changed the title initial grpo files [GRPO] initial GRPO trainer Aug 21, 2024
@saisurbehera saisurbehera marked this pull request as draft August 21, 2024 02:24
@lewtun
Copy link
Member

lewtun commented Aug 21, 2024

Thank you for working on this nifty algorithm @saisurbehera ! I see you're basing your implementation on PPOTrainer but we've recently overhauled our RL implementations to be more aligned with the rest of the library, e.g. here's the new PPO version: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py

Would you mind adapting your implementation to this new API? Since GRPO is somewhat similar to RLOO, you might find it is possible to copy-paste a large part of that code: https://github.com/huggingface/trl/blob/main/trl/trainer/rloo_trainer.py

@saisurbehera
Copy link
Author

Sure, i can make the changes similar to PPOtrainerv2

@saisurbehera
Copy link
Author

Hello @lewtun ,

I ported the format to the new methodlogy, it was way simpler than the first version. I still have to do some validations and testing.

@Namco0816
Copy link

Thanks for you contribution!

I've also implement a version of GRPO trainer, instead of using a for loop in https://github.com/saisurbehera/trl/blob/grpo/trl/trainer/grpo_trainer.py#L380, I directly view it to (-1, sampling_group_size) and calculate the normalized_group_scores in a tensor-friendly way and then view it back to the original shape. I am not sure if this will help to optimize the performance.

@saisurbehera
Copy link
Author

I think we should test it out. Thanks a lot for the change.

Overall, I do think most of my work is done based on the limits of trl. We need more extensive changes to add PRM and reward model training.

Sorry for not spending some more time on it, I was busy at work and family stuff.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Nov 30, 2024

Hi, thanks for the PR, it would be great to have GRPO and looking forward to it!

@rawsh
Copy link

rawsh commented Dec 17, 2024

@saisurbehera curious if it's ready to test? want to try my hand at hacking in PRM rewards

@saisurbehera
Copy link
Author

Go ahead

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 22, 2024

Hi, is there any updates? I would appreciate it if this could be merged!

@saisurbehera
Copy link
Author

Let me work over this weekend to verify, Sorry for the delay.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 25, 2024

Looking forward to it!

@hijkzzz
Copy link

hijkzzz commented Dec 27, 2024

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 31, 2024

Hi is there any updates?

@saisurbehera
Copy link
Author

Sorry for the late response. My code works now. The problem is the new model has very high KL divergence compared to the reference model. The scores compared to rloo don't look right. I have to debug as to why. Sorry for the it.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Jan 2, 2025

It's OK and looking forward to the fix!

@ehartford
Copy link

Do you have example training dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GRPO as part of HF TRL?
7 participants