Closed
Description
Method description
Offline Reinforcement Learning for LLM Multi-Step Reasoning introduces the OREO algorithm, a way to improve how large language models (LLMs) handle multi-step reasoning using offline reinforcement learning. It helps assign credit better and reduces the need for pairwise data.
I’d like to integrate OREO into Hugging Face and TRL so it can work with tools like PEFT and quantization. The goal is to make it easier for people to use, along with its test-time compute method.
@jwhj has already shared the code here: OREO repo.
The current implementation is a great start, but bringing it into Hugging Face’s ecosystem would make it more user-friendly and widely usable.
Would love to hear your thoughts on whether this would be a helpful addition!
Open source status
- The method implementation is available
- The model weights are available
- The training datasets are available
Activity