update docs about how to submitting to leaderboard

fiddlecube · Sep 28, 2024 · f068e12 · f068e12
1 parent 5b3589a
commit f068e12
Show file tree

Hide file tree

Showing 2 changed files with 61 additions and 7 deletions.
diff --git a/docs/submit_to_leaderboard.md b/docs/submit_to_leaderboard.md
@@ -1,9 +1,14 @@
 # Submit to Leaderboard
 
-The AIR-Bench is designed to be a closed-book benchmark. The golden truth is kept private. We provide a leaderboard for participants to submit
-the top-k search results of their models and compare their performance with others. The leaderboard is hosted on the HuggingFace Hub.
+The initial version `AIR-Bench_24.04` is designed to be a closed-book benchmark, where the golden truth is kept private for users. We provide a leaderboard for participants to submit the top-k search results of their models and compare their performance with others. The leaderboard is hosted on the HuggingFace Hub.
 
-To submit your model to the leaderboard, please follow the steps below.
+However, according to the feedback from the community (refer to [issue #26](https://github.com/AIR-Bench/AIR-Bench/issues/26)), users could not know the evaluation results until they submit their models’ search results to the leaderboard. It is not user-friendly for users to iterate their models’ performance.
+
+Therefore, for the latest version `AIR-Bench_24.05`, we **split the queries into test set and dev set**. The golden truth of the dev set is provided to users, so that users could evaluate their models’ performance on the dev set by themselves. The golden truth for the test set is still kept private for users to submit their models’ search results to the leaderboard.
+
+To **evaluate your models on the dev set**, please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/scripts#5-compute-metrics-for-dev-set-optional). *We will contribute the dev set to [MTEB](https://huggingface.co/spaces/mteb/leaderboard) to help the users compare their models with other models*.
+
+To **submit your model to the leaderboard** for evaluating on the test set, please follow the steps below.
 
 ## Installation
 
@@ -21,6 +26,49 @@ After running the evaluation, you will get the search results in the `output_dir
 
 For example, if you run the evaluation script for the `bge-m3` retrieval model and the `bge-reranker-v2-m3` reranking model, the file structure of the search results will be like this:
 
+<details><summary>air-benchmark>=0.1.0 (pypi version)</summary>
+
+```shell
+search_results/
+├── bge-m3/
+│   ├── NoReranker/
+│   │   ├── qa
+│   │   │   ├── arxiv
+│   │   │   │   ├── en_default_dev.json
+│   │   │   │   ├── en_default_test.json
+│   │   │   ├── finance
+│   │   │   │   ├── en_default_dev.json
+│   │   │   │   ├── en_default_test.json
+│   │   │   │   ├── zh_default_dev.json
+│   │   │   │   ├── zh_default_test.json
+│   │   │   │   ...
+│   │   ├── long-doc
+│   │   │   ├── book
+│   │   │   │   ├── en_a-brief-history-of-time_stephen-hawking_dev.json
+│   │   │   │   ├── en_origin-of-species_darwin_test.json
+│   │   │   │   ...
+│   ├── bge-reranker-v2-m3/
+│   │   ├── qa
+│   │   │   ├── arxiv
+│   │   │   │   ├── en_default_dev.json
+│   │   │   │   ├── en_default_test.json
+│   │   │   ├── finance
+│   │   │   │   ├── en_default_dev.json
+│   │   │   │   ├── en_default_test.json
+│   │   │   │   ├── zh_default_dev.json
+│   │   │   │   ├── zh_default_test.json
+│   │   │   │   ...
+│   │   ├── long-doc
+│   │   │   ├── book
+│   │   │   │   ├── en_a-brief-history-of-time_stephen-hawking_dev.json
+│   │   │   │   ├── en_origin-of-species_darwin_test.json
+│   │   │   │   ...
+```
+
+</details>
+
+<details><summary>air-benchmark<=0.0.4 (pypi version)</summary>
+
 ```shell
 search_results/
 ├── bge-m3/
@@ -52,6 +100,8 @@ search_results/
 │   │   │   │   ...
 ```
 
+</details>
+
 ## Submit search results
 
 ### Package the output files

diff --git a/scripts/README.md b/scripts/README.md
@@ -170,13 +170,14 @@ python evaluate_hf_transformers.py \
 --overwrite False
 ```
 
-- Run the tasks in the specified task type, domains, and languages:
+- Run the tasks in the specified task type, domains, languages and splits:
 
 ```bash
 python evaluate_hf_transformers.py \
 --task_types qa \
 --domains finance law \
 --languages en \
+--splits test \
 --output_dir ./search_results \
 --encoder BAAI/bge-m3 \
 --search_top_k 1000 \
@@ -226,13 +227,14 @@ python evaluate_sentence_transformers.py \
 --overwrite False
 ```
 
-- Run the tasks in the specified task type, domains, and languages:
+- Run the tasks in the specified task type, domains, languages and splits:
 
 ```bash
 python evaluate_sentence_transformers.py \
 --task_types qa \
 --domains finance law \
 --languages en \
+--splits test \
 --output_dir ./search_results \
 --encoder sentence-transformers/all-MiniLM-L6-v2 \
 --search_top_k 1000 \
@@ -274,13 +276,14 @@ python evaluate_bm25.py \
 --overwrite False
 ```
 
-- Run the tasks in the specified task type, domains, and languages:
+- Run the tasks in the specified task type, domains, languages and splits:
 
 ```bash
 python evaluate_bm25.py \
 --task_types qa \
 --domains finance law \
 --languages en \
+--splits test \
 --output_dir ./search_results \
 --bm25_tmp_dir ./bm25_tmp_dir \
 --remove_bm25_tmp_dir True \
@@ -316,13 +319,14 @@ python evaluate_bm25.py \
 --overwrite False
 ```
 
-- Run the tasks in the specified task type, domains, and languages:
+- Run the tasks in the specified task type, domains, languages and splits:
 
 ```bash
 python evaluate_bm25.py \
 --task_types qa \
 --domains finance law \
 --languages en \
+--splits test \
 --output_dir ./search_results \
 --bm25_tmp_dir ./bm25_tmp_dir \
 --remove_bm25_tmp_dir True \