Skip to content

Commit

Permalink
update docs about how to submitting to leaderboard
Browse files Browse the repository at this point in the history
  • Loading branch information
hanhainebula committed Sep 28, 2024
1 parent 5b3589a commit f068e12
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 7 deletions.
56 changes: 53 additions & 3 deletions docs/submit_to_leaderboard.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# Submit to Leaderboard

The AIR-Bench is designed to be a closed-book benchmark. The golden truth is kept private. We provide a leaderboard for participants to submit
the top-k search results of their models and compare their performance with others. The leaderboard is hosted on the HuggingFace Hub.
The initial version `AIR-Bench_24.04` is designed to be a closed-book benchmark, where the golden truth is kept private for users. We provide a leaderboard for participants to submit the top-k search results of their models and compare their performance with others. The leaderboard is hosted on the HuggingFace Hub.

To submit your model to the leaderboard, please follow the steps below.
However, according to the feedback from the community (refer to [issue #26](https://github.com/AIR-Bench/AIR-Bench/issues/26)), users could not know the evaluation results until they submit their models’ search results to the leaderboard. It is not user-friendly for users to iterate their models’ performance.

Therefore, for the latest version `AIR-Bench_24.05`, we **split the queries into test set and dev set**. The golden truth of the dev set is provided to users, so that users could evaluate their models’ performance on the dev set by themselves. The golden truth for the test set is still kept private for users to submit their models’ search results to the leaderboard.

To **evaluate your models on the dev set**, please refer to [here](https://github.com/AIR-Bench/AIR-Bench/blob/main/scripts#5-compute-metrics-for-dev-set-optional). *We will contribute the dev set to [MTEB](https://huggingface.co/spaces/mteb/leaderboard) to help the users compare their models with other models*.

To **submit your model to the leaderboard** for evaluating on the test set, please follow the steps below.

## Installation

Expand All @@ -21,6 +26,49 @@ After running the evaluation, you will get the search results in the `output_dir

For example, if you run the evaluation script for the `bge-m3` retrieval model and the `bge-reranker-v2-m3` reranking model, the file structure of the search results will be like this:

<details><summary>air-benchmark>=0.1.0 (pypi version)</summary>

```shell
search_results/
├── bge-m3/
│ ├── NoReranker/
│ │ ├── qa
│ │ │ ├── arxiv
│ │ │ │ ├── en_default_dev.json
│ │ │ │ ├── en_default_test.json
│ │ │ ├── finance
│ │ │ │ ├── en_default_dev.json
│ │ │ │ ├── en_default_test.json
│ │ │ │ ├── zh_default_dev.json
│ │ │ │ ├── zh_default_test.json
│ │ │ │ ...
│ │ ├── long-doc
│ │ │ ├── book
│ │ │ │ ├── en_a-brief-history-of-time_stephen-hawking_dev.json
│ │ │ │ ├── en_origin-of-species_darwin_test.json
│ │ │ │ ...
│ ├── bge-reranker-v2-m3/
│ │ ├── qa
│ │ │ ├── arxiv
│ │ │ │ ├── en_default_dev.json
│ │ │ │ ├── en_default_test.json
│ │ │ ├── finance
│ │ │ │ ├── en_default_dev.json
│ │ │ │ ├── en_default_test.json
│ │ │ │ ├── zh_default_dev.json
│ │ │ │ ├── zh_default_test.json
│ │ │ │ ...
│ │ ├── long-doc
│ │ │ ├── book
│ │ │ │ ├── en_a-brief-history-of-time_stephen-hawking_dev.json
│ │ │ │ ├── en_origin-of-species_darwin_test.json
│ │ │ │ ...
```

</details>

<details><summary>air-benchmark<=0.0.4 (pypi version)</summary>

```shell
search_results/
├── bge-m3/
Expand Down Expand Up @@ -52,6 +100,8 @@ search_results/
│ │ │ │ ...
```

</details>

## Submit search results

### Package the output files
Expand Down
12 changes: 8 additions & 4 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,13 +170,14 @@ python evaluate_hf_transformers.py \
--overwrite False
```

- Run the tasks in the specified task type, domains, and languages:
- Run the tasks in the specified task type, domains, languages and splits:

```bash
python evaluate_hf_transformers.py \
--task_types qa \
--domains finance law \
--languages en \
--splits test \
--output_dir ./search_results \
--encoder BAAI/bge-m3 \
--search_top_k 1000 \
Expand Down Expand Up @@ -226,13 +227,14 @@ python evaluate_sentence_transformers.py \
--overwrite False
```

- Run the tasks in the specified task type, domains, and languages:
- Run the tasks in the specified task type, domains, languages and splits:

```bash
python evaluate_sentence_transformers.py \
--task_types qa \
--domains finance law \
--languages en \
--splits test \
--output_dir ./search_results \
--encoder sentence-transformers/all-MiniLM-L6-v2 \
--search_top_k 1000 \
Expand Down Expand Up @@ -274,13 +276,14 @@ python evaluate_bm25.py \
--overwrite False
```

- Run the tasks in the specified task type, domains, and languages:
- Run the tasks in the specified task type, domains, languages and splits:

```bash
python evaluate_bm25.py \
--task_types qa \
--domains finance law \
--languages en \
--splits test \
--output_dir ./search_results \
--bm25_tmp_dir ./bm25_tmp_dir \
--remove_bm25_tmp_dir True \
Expand Down Expand Up @@ -316,13 +319,14 @@ python evaluate_bm25.py \
--overwrite False
```

- Run the tasks in the specified task type, domains, and languages:
- Run the tasks in the specified task type, domains, languages and splits:

```bash
python evaluate_bm25.py \
--task_types qa \
--domains finance law \
--languages en \
--splits test \
--output_dir ./search_results \
--bm25_tmp_dir ./bm25_tmp_dir \
--remove_bm25_tmp_dir True \
Expand Down

0 comments on commit f068e12

Please sign in to comment.