Please refer to the website for more details.
TODO, paste paper content here
Check out the docs for more details.
- Diverse task roles:
- Software Engineer
- Product Manager
- Data Scientist
- Human Resource
- Financial Staff
- Administrator
- Diverse data types:
- Coding tasks
- Conversational tasks
- Mathematical reasoning
- Image processing
- Text comprehension
- Multiple Agent Interaction
- Comprehensive scoring system
- Result-based evaluation (primary)
- Subcheckpoints checking (secondary)
- Multiple evaluation methods:
- Deterministic evaluators
- LLM-based evaluators
- Simple one-command operations:
- Complete environment setup in minutes
- Quick system reset in minutes when needed
- Extensible benchmark framework
- Add new tasks/evaluators/subcheckpoints in minutes
Currently, we are not accepting task contributions for first version benchmark. But we welcome any contributions to bug fixes, documentation, and other improvements. Questions? Please create an issue. Otherwise, you can also contact Frank F. Xu, Yufan Song, Boxuan Li (Email: fangzhex@cs.cmu.edu, yufans@alumni.cmu.edu, boxuanli@alumni.cmu.edu)
TODO
Distributed under the MIT License. See LICENSE for more information.