evaluation

Run Evaluation with OpenHands

This directory contains the baseline for The Agent Company. The baseline is conducted using OpenHands, an open-source platform for software development agents powered by AI.

Prerequisites

If you need to use OpenHands for evaluation, you also need:

python 3.12 or above
poetry
install dependencies using poetry install under project root directory
docker buildx

Configuration

Please create a config.toml file under evaluation directory. It should look like this:

[llm.group1]
model="<model_name>"
base_url="<base_url>"
api_key="<api_key>"

[llm.group2]
model="<model_name>"
base_url="<base_url>"
api_key="<api_key>"

you can add more groups as needed.

Run Evaluation

Under evaluation directory, run the following command:

bash run_eval.sh \
  --agent-llm-config group1 \
  --env-llm-config group2 \
  --outputs-path outputs \
  --server-hostname localhost \
  --version 1.0.0

where --outputs-path, --server-hostname, and --version are optional.

Here's a brief explanation of each argument:

--agent-llm-config: the config name for the agent LLM. This should match the config name in config.toml. This is the LLM used by the agent (i.e. CodeActAgent).
--env-llm-config: the config name for the environment LLM. This should match the config name in config.toml. This is used by the chat bots (NPCs) and LLM-based evaluators.
--outputs-path: the path to save trajectories and evaluation results.
--server-hostname: the hostname of the server that hosts all the web services. It could be localhost if you are running the evaluation and services on the same machine. If the services are hosted on a remote machine, you must use the hostname of the remote machine rather than IP address.
--version: the version of the task images to use. Currently, the only supported version is 1.0.0.

The script is idempotent. If you run it again, it will resume from the last checkpoint. It would usually take a few days to finish evaluation.

Note: the script will automatically skip a task if it encounters an error. This usually happens when the OpenHands runtime dies due to some unexpected errors. This means even if the script finishes, it might not have evaluated all tasks. You can manually resume the evaluation by running the script again.

Pre-Build Runtime Images

OpenHands builds a unique runtime image on top of each task image on the fly. If you wanted to pre-build all runtime images so that they don't need to be built when running evaluation, you could goto root directory and run

./evaluation/build_oh_runtime_images.sh

which would iterate over all tasks, pull their images, and build OpenHands runtime images accordingly.

Analysis

You can find the evaluation results in the outputs directory, including trajectories, evaluation scores, final agent states, and screenshots for all browsing steps.

You can run the following command to generate a summary of the evaluation results:

poetry run python summarise_results.py <outputs_path>

An example of the summary report can be found here.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
browsing.py		browsing.py
build_oh_runtime_images.sh		build_oh_runtime_images.sh
run_eval.py		run_eval.py
run_eval.sh		run_eval.sh
summarise_results.py		summarise_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

README.md

Run Evaluation with OpenHands

Prerequisites

Configuration

Run Evaluation

Pre-Build Runtime Images

Analysis

Files

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Run Evaluation with OpenHands

Prerequisites

Configuration

Run Evaluation

Pre-Build Runtime Images

Analysis