Integrate OREO into TRL and HF

### Method description



[Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/pdf/2412.16145) introduces the OREO algorithm, a way to improve how large language models (LLMs) handle multi-step reasoning using offline reinforcement learning. It helps assign credit better and reduces the need for pairwise data.

I’d like to integrate OREO into Hugging Face and TRL so it can work with tools like PEFT and quantization. The goal is to make it easier for people to use, along with its test-time compute method.

@jwhj has already shared the code here: [OREO repo](https://github.com/jwhj/OREO).
The current implementation is a great start, but bringing it into Hugging Face’s ecosystem would make it more user-friendly and widely usable.

Would love to hear your thoughts on whether this would be a helpful addition!

### Open source status

- [X] The method implementation is available
- [X] The model weights are available
- [X] The training datasets are available

### Provide useful links for the implementation

@jwhj


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate OREO into TRL and HF #2525

Method description

Open source status

Provide useful links for the implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Integrate OREO into TRL and HF #2525

Description

Method description

Open source status

Provide useful links for the implementation

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions