Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.
- Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
- Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.
- 🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
- 🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
- 🔄 Heterogeneous and Dynamic. The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.
We plan to release new test dataset on regular basis. The latest version of is 24.04
. You could check out the results at
AIR-Bench Leaderboard.
Detailed results are available here.
This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark
.
pip install air-benchmark
Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).
-
Run evaluations
- See the scripts to run evaluations on AIR-Bench for your models.
-
Submit search results
-
Package the output files
- As for the results without reranking models,
cd scripts python zip_results.py \ --results_dir search_results \ --retriever_name [YOUR_RETRIEVAL_MODEL] \ --save_dir search_results
- As for the results with reranking models
cd scripts python zip_results.py \ --results_dir search_results \ --retriever_name [YOUR_RETRIEVAL_MODEL] \ --reranker_name [YOUR_RERANKING_MODEL] \ --save_dir search_results
-
Upload the output
.zip
and fill in the model information at AIR-Bench Leaderboard
-
Documentation | |
---|---|
🏭 Pipeline | The data generation pipeline of AIR-Bench |
📋 Tasks | Overview of available tasks in AIR-Bench |
📈 Leaderboard | The interactive leaderboard of AIR-Bench |
🚀 Submit | Information related to how to submit a model to AIR-Bench |
🤝 Contributing | How to contribute to AIR-Bench |
This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.
TBD