Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process reward modeling support #362

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

fabrahman
Copy link
Contributor

@fabrahman fabrahman commented Sep 21, 2024

This is a commit to support process reward modeling includes supports for:

  1. processing prm formatted data,
  2. computing process rewards at the end of each step, and
  3. doing cross entropy over predicted logits over end of step tokens wrt labels.

Currently testing it here: https://beaker.org/ex/01J8AK58TPQ48SDH5AKSHVTZMK/tasks/01J8AK58TV4X7Y2VPQ9PG297WP/job/01J8AK5902D3935VH6SQMEMFV0

TODOs: refactoring the evaluation code for this prm task

@fabrahman fabrahman requested a review from vwxyzjn September 21, 2024 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant