The Agent Company: Benchmarking LLM Agents on Consequential Real World Tasks

Overview

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. TheAgentCompany measures the progress of these LLM agents' performance on performing real-world professional tasks, by providing an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.

Quick Start

Step 1: Setup the Servers

Servers can be hosted locally or on the cloud in a few minutes.

Instruction for Mac and Linux users

# you should have docker and docker compose installed, and 30+ GB of free disk space
# Mac users must have host networking enabled
sudo chmod 666 /var/run/docker.sock
curl -fsSL https://github.com/TheAgentCompany/the-agent-company-backup-data/releases/download/setup-script-20241208/setup.sh | sh

Instruction for Windows users

# you should have docker and docker compose installed, and 30+ GB of free disk space
# you must have host networking enabled
curl -fsSL -o setup.bat https://github.com/TheAgentCompany/the-agent-company-backup-data/releases/download/setup-script-20241208/setup.bat && setup.bat

After a few minutes, you should have all services running, including GitLab, Plane, ownCloud, RocketChat, all with pre-baked data. Please check out the SERVER SETUP DOC for more details and troubleshooting guide, especially if you are using Mac or Windows.

Step 2: Run the Benchmark

Every task is a Docker image with the following structure:

/utils
├── evaluator.py.enc
├── init.sh
├── config.py
├── common.py
├── eval.py
├── npc
├── ...
/instruction
├── task.md
├── ...
/workspace
├── ...

where /utils/init.sh is the script you must run to initialize the task environment, /utils/eval.py is the entrypoint to run the grading functions, and /instruction/task.md is the task instruction for the examinee, i.e. your agent.

Benchmark with OpenHands

If you want to run the benchmark using the OpenHands platform, it's as simple as:

cd evaluation
# set up agent and environment LLM configs in config.toml, omitted
bash run_eval.sh \
  --agent-llm-config <group1> \
  --env-llm-config <group2> \
  --outputs-path <outputs> \
  --server-hostname <hostname> \
  --version 1.0.0

Please check out this doc for more details.

Benchmark with other Platforms

This applies if you are using agents not from OpenHands, or want to run the benchmark manually by human testers.

Step 2.1: Start Task Container

docker run --name <container_name> -it <image_name> /bin/bash

A complete list of 175 task images can be found here.

Step 2.2: Initialize the Task Environment

SERVER_HOSTNAME=<hostname, default value is localhost> \
LITELLM_API_KEY=<environment_llm_api_key> \
LITELLM_BASE_URL=<environment_llm_base_url> \
LITELLM_MODEL=<environment_llm_model_name> \
bash /utils/init.sh

Step 2.3: Conduct the Task

Now you can prompt the agent to work on the task. The task instruction is in /instruction/task.md.

Complete the task in /instruction/task.md

Step 2.4: Grade the Result

LITELLM_API_KEY=<environment_llm_api_key> \
LITELLM_BASE_URL=<environment_llm_base_url> \
LITELLM_MODEL=<environment_llm_model_name> \
DECRYPTION_KEY='theagentcompany is all you need' \
python_default /utils/eval.py --trajectory_path TRAJECTORY_PATH --output_path OUTPUT_PATH

Please check out the EVALUATION DOC for more details.

Exciting Features

Diverse task roles:
- Software Engineer
- Product Manager
- Data Scientist
- Human Resource
- Financial Staff
- Administrator
Diverse data types:
- Coding tasks
- Conversational tasks
- Mathematical reasoning
- Image processing
- Text comprehension
Multiple Agent Interaction
Comprehensive scoring system
- Result-based evaluation (primary)
- Subcheckpoints checking (secondary)
Multiple evaluation methods:
- Deterministic evaluators
- LLM-based evaluators
Simple one-command operations:
- Complete environment setup in minutes
- Quick system reset in minutes when needed
Extensible benchmark framework
- Add new tasks/evaluators/subcheckpoints in minutes

Cite

@misc{xu2024theagentcompanybenchmarkingllmagents,
      title={TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks}, 
      author={Frank F. Xu and Yufan Song and Boxuan Li and Yuxuan Tang and Kritanjali Jain and Mengxue Bao and Zora Z. Wang and Xuhui Zhou and Zhitong Guo and Murong Cao and Mingyang Yang and Hao Yang Lu and Amaad Martin and Zhe Su and Leander Maben and Raj Mehta and Wayne Chi and Lawrence Jang and Yiqing Xie and Shuyan Zhou and Graham Neubig},
      year={2024},
      eprint={2412.14161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.14161}, 
}

Contribution

We welcome any contributions to bug fixes, documentation, and other improvements. Questions? Please create an issue. Otherwise, you can also contact Frank F. Xu, Yufan Song, Boxuan Li (Email: fangzhex@cs.cmu.edu, yufans@alumni.cmu.edu, boxuanli@alumni.cmu.edu)

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 699 Commits
.github		.github
docs		docs
evaluation		evaluation
servers		servers
workspaces		workspaces
.gitignore		.gitignore
.openhands_instruction		.openhands_instruction
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Agent Company: Benchmarking LLM Agents on Consequential Real World Tasks

Overview

Quick Start

Step 1: Setup the Servers

Step 2: Run the Benchmark

Benchmark with OpenHands

Benchmark with other Platforms

Step 2.1: Start Task Container

Step 2.2: Initialize the Task Environment

Step 2.3: Conduct the Task

Step 2.4: Grade the Result

Exciting Features

Cite

Contribution

License

About

Releases 1

Packages

Contributors 23

Languages

License

TheAgentCompany/TheAgentCompany

Folders and files

Latest commit

History

Repository files navigation

The Agent Company: Benchmarking LLM Agents on Consequential Real World Tasks

Overview

Quick Start

Step 1: Setup the Servers

Step 2: Run the Benchmark

Benchmark with OpenHands

Benchmark with other Platforms

Step 2.1: Start Task Container

Step 2.2: Initialize the Task Environment

Step 2.3: Conduct the Task

Step 2.4: Grade the Result

Exciting Features

Cite

Contribution

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 23

Languages

Packages