-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e4d2f8c
commit 0dd9567
Showing
1 changed file
with
34 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,35 @@ | ||
# NovelEval | ||
*A new test set with the novel queries and passages that have not been contaminated by the latest LLMs* | ||
|
||
The questions in the current benchmark dataset are typically gathered years ago, which raises the issue that existing LLMs already possess knowledge of these questions. | ||
Furthermore, since many LLMs do not disclose information about their training data, there is a potential risk of contamination of the existing benchmark test set. | ||
However, re-ranking models are expected to possess the capability to comprehend, deduce, and rank knowledge that is inherently unknown to them. | ||
|
||
Therefore, we suggest constructing **continuously updated IR test sets** to ensure that the questions, passages to be ranked, and relevance annotations have not been learned by the latest LLMs for a fair evaluation. | ||
|
||
## Data Collection | ||
As an initial effort, we built **NovelEval-2306**, a novel test set with 21 novel questions collected during 2023-06. | ||
This test set is constructed by gathering questions and passages fromfrom 4 domains that were published after the release of GPT-4. | ||
To ensure that GPT-4 did not possess prior knowledge of these questions, we presented them to both gpt-4-0314 and gpt-4-0613. | ||
For instance, question *"Which film was the 2023 Palme d'Or winner?"* pertains to the Cannes Film Festival that took place on May 27, 2023, rendering its answer inaccessible to most existing LLMs. | ||
Next, we searched 20 candidate passages for each question using Google search. | ||
The relevance of these passages was manually labeled as: 0 for not relevant, 1 for partially relevant, and 2 for relevant. | ||
|
||
|
||
## Files | ||
| Type | Filename | Format| | ||
| ---- | ---- | ---- | | ||
| Corpus | [corpus.tsv](https://github.com/sunnweiwei/RankGPT/blob/main/NovelEval/corpus.tsv) | tsv: docid, content | | ||
| Queries | [queries.tsv](https://github.com/sunnweiwei/RankGPT/blob/main/NovelEval/queries.tsv) | tsv: qid, query | | ||
| Qrels | [qrels.txt](https://github.com/sunnweiwei/RankGPT/blob/main/NovelEval/qrels.txt) | TREC qrels format: qid, Q0, docid, relevance-score | | ||
|
||
## Results | ||
|
||
| Method | nDCG@1 | nDCG@5 | nDCG@10 | | ||
| ---- | ----- | ----- | ----- | | ||
| BM25 | 33.33 | 45.96 | 55.77 | | ||
| monoBERT (340M) | 78.57 | 70.65 | 77.27 | | ||
| monoT5 (220M) | 83.33 | 77.46 | 81.27 | | ||
| monoT5 (3B) | 83.33 | 78.38 | 84.62 | | ||
| gpt-3.5-turbo | 76.19 | 74.15 | 75.71 | | ||
| **gpt-4** | **85.71** | **87.49** | **90.45** | |