forked from sotopia-lab/sotopia
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Improvements on benchmark display and usage (sotopia-lab#135)
* add stdev for stats, and also provide a benchmark script * fix mypy issue * add 95% CI * use numpy instead of scipy in CI * clean up code & add doc for benchmark * [autofix.ci] apply automated fixes * code cleanup * minor bug fix and code cleanup * refactor benchmark code * add more models * fix mypy * remove binary option * remove extra binary * change the stop criteria to be more accurate * add test cases for benchmark, also fix small t value bug * [autofix.ci] apply automated fixes * fix issue of episode not found * fix mypy error * [autofix.ci] apply automated fixes * add test cases for benchmark, and minor code changes * remove unnecessary print code * add coverage by actually running one time and adding more mock data * fix autofix error * fix patching issues * add unit tests for more arguments * fix test_get_agent_by_name issue * make the test stricter --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: XuhuiZhou <zhouxuhui2018@gmail.com>
- Loading branch information
1 parent
77a97f8
commit 850d441
Showing
8 changed files
with
737 additions
and
254 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,14 @@ | ||
# Benchmark your model as a social agent in Sotopia | ||
|
||
``` | ||
sotopia_benchmark --model=<your_model_name> | ||
sotopia benchmark --models <model1> --models <model2> [--only-show-performance] | ||
``` | ||
or | ||
|
||
``` | ||
python sotopia/benchmark/cli.py --model=<your_model_name> | ||
python sotopia/cli/benchmark/benchmark.py --models <model1> --models <model2> [--only-show-performance] | ||
``` | ||
When `only-show-performance` is speficied, only model results with available episodes will be displayed. If this option is not used, the benchmark will be run. | ||
Currently this script would run over 100 simulations on the Sotopia Hard tasks. And the partner model is fixed to be `meta-llama/Llama-3-70b-chat-hf` | ||
|
||
An example script is provided in `scripts/display_benchmark_results.sh` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
sotopia benchmark --only-show-performance \ | ||
--models gpt-4o \ | ||
--models together_ai/mistralai/Mixtral-8x22B-Instruct-v0.1 \ | ||
--models gpt-3.5-turbo \ | ||
--models together_ai/meta-llama/Llama-3-70b-chat-hf \ | ||
--models together_ai/meta-llama/Llama-3-8b-chat-hf |
Oops, something went wrong.