Website • Paper • Leaderboard
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. TheAgentCompany measures the progress of these LLM agents' performance on performing real-world professional tasks, by providing an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.
Servers can be hosted locally or on the cloud in a few minutes.
Instruction for Mac and Linux users
# you should have docker and docker compose installed, and 30+ GB of free disk space
# Mac users must have host networking enabled
sudo chmod 666 /var/run/docker.sock
curl -fsSL https://github.com/TheAgentCompany/the-agent-company-backup-data/releases/download/setup-script-20241208/setup.sh | sh
Instruction for Windows users
# you should have docker and docker compose installed, and 30+ GB of free disk space
# you must have host networking enabled
curl -fsSL -o setup.bat https://github.com/TheAgentCompany/the-agent-company-backup-data/releases/download/setup-script-20241208/setup.bat && setup.bat
After a few minutes, you should have all services running, including GitLab, Plane, ownCloud, RocketChat, all with pre-baked data. Please check out the SERVER SETUP DOC for more details and troubleshooting guide, especially if you are using Mac or Windows.
Every task is a Docker image with the following structure:
/utils
├── evaluator.py.enc
├── init.sh
├── config.py
├── common.py
├── eval.py
├── npc
├── ...
/instruction
├── task.md
├── ...
/workspace
├── ...
where /utils/init.sh
is the script you must run to initialize the task environment,
/utils/eval.py
is the entrypoint to run the grading functions, and
/instruction/task.md
is the task instruction for the examinee, i.e. your agent.
If you want to run the benchmark using the OpenHands platform, it's as simple as:
cd evaluation
# set up agent and environment LLM configs in config.toml, omitted
bash run_eval.sh \
--agent-llm-config <group1> \
--env-llm-config <group2> \
--outputs-path <outputs> \
--server-hostname <hostname> \
--version 1.0.0
Please check out this doc for more details.
This applies if you are using agents not from OpenHands, or want to run the benchmark manually by human testers.
docker run --name <container_name> -it <image_name> /bin/bash
A complete list of 175 task images can be found here.
SERVER_HOSTNAME=<hostname, default value is localhost> \
LITELLM_API_KEY=<environment_llm_api_key> \
LITELLM_BASE_URL=<environment_llm_base_url> \
LITELLM_MODEL=<environment_llm_model_name> \
bash /utils/init.sh
Now you can prompt the agent to work on the task. The task instruction is in /instruction/task.md
.
Complete the task in /instruction/task.md
LITELLM_API_KEY=<environment_llm_api_key> \
LITELLM_BASE_URL=<environment_llm_base_url> \
LITELLM_MODEL=<environment_llm_model_name> \
DECRYPTION_KEY='theagentcompany is all you need' \
python_default /utils/eval.py --trajectory_path TRAJECTORY_PATH --output_path OUTPUT_PATH
Please check out the EVALUATION DOC for more details.
- Diverse task roles:
- Software Engineer
- Product Manager
- Data Scientist
- Human Resource
- Financial Staff
- Administrator
- Diverse data types:
- Coding tasks
- Conversational tasks
- Mathematical reasoning
- Image processing
- Text comprehension
- Multiple Agent Interaction
- Comprehensive scoring system
- Result-based evaluation (primary)
- Subcheckpoints checking (secondary)
- Multiple evaluation methods:
- Deterministic evaluators
- LLM-based evaluators
- Simple one-command operations:
- Complete environment setup in minutes
- Quick system reset in minutes when needed
- Extensible benchmark framework
- Add new tasks/evaluators/subcheckpoints in minutes
@misc{xu2024theagentcompanybenchmarkingllmagents,
title={TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks},
author={Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Z. Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Maben and Raj Mehta and Wayne Chi and Lawrence Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig},
year={2024},
eprint={2412.14161},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.14161},
}
We welcome any contributions to bug fixes, documentation, and other improvements. Questions? Please create an issue. Otherwise, you can also contact Frank F. Xu, Yufan Song, Boxuan Li (Email: fangzhex@cs.cmu.edu, yufans@alumni.cmu.edu, boxuanli@alumni.cmu.edu)
Distributed under the MIT License. See LICENSE for more information.