The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia.
The NQ-Open task format was also used as part of the EfficientQA competition at NeurIPS 2020. Results from the EfficientQA competition are reported in Min et.al. 2021.
The EfficientQA competition used different dev and test splits from the original NQ-open task. This repository contains both the original NQ-open data, as well as the EfficientQA data. Users should take care to ensure they are reporting metrics on the correct splits. All work preceeding the EfficientQA competition, in December 2020, reports results on the NQ-open Dev split.
The different splits have all been created from Natural Questions data using
this conversion script.
Split statistics are given below. More details on the data format, including
the various answer
fields present in the EfficientQA test set, are given
in the Data Format section of this page.
Split | Size | Filename |
---|---|---|
Train | 87,925 | NQ-open.train.jsonl |
Original Dev | 3,610 | NQ-open.dev.jsonl |
EfficientQA Dev | 1,800 | NQ-open.efficientqa.dev.1.1.jsonl |
EfficientQA Test | 1,769 | NQ-open.efficientqa.test.1.1.jsonl |
All of the Natural Questions data is released under the CC BY-SA 3.0 license.
All of the data splits, apart from NQ-open.efficientqa.test.1.1.jsonl
, contain
the following fields:
question: 'who signed the sugauli treaty on behalf of nepal'
answer: ['Raj Guru Gajaraj Mishra']
and predictions should be compared to the contents of the answer
field
using the
NQ-open evaluation script.
As part of the EfficientQA competition, predictions from the top performing submission were sent for further evaluation. The details of this evaluation are provided in Min et.al. 2021.
Instead of a single answer
field, the EfficientQA test set has the following
fields containing reference answer strings.
answer
: contains the answers from the original NQ annotations,def_correct_predictions
: contains predictions from top performing submissions that were determined to be definitely correct by annotators,poss_correct_predictions
: contains preditions from top performing submissions that were determined to be possibly correct given some interpretation of the question.answer_and_def_correct_predictions
: contains the union ofanswer
anddef_correct_predictions
.
We include poss_correct_predictions
in this release for completeness. However,
we do not suggest that these are used for evaluation of any systems since they
rarely provide a properly satisfactory answer to the question. You can evaluate
your predictions on the standard answer
references or the expanded
answer_and_def_correct_predictions
using the
evaluation_code
with the appropriate answer_field
flag.
For further discussion of the data, as well as our recommendations for robust evaluation, please see Min et.al. 2021. Also, please remember that almost all work to date has reported accuracy on the original NQ-open dev set described above. Any work that uses the EfficientQA test set should describe this choice explicitly.
Due to a bug in the post-competition labeling process, there may be small discrepancies between results calculated using the 1,769 examples released as part of the EfficientQA rated test data and the 1,800 examples used in the original EfficientQA leaderboard. Refer to Min et.al. 2021 for official results.
Method | Original Dev | EfficientQa Dev | EfficientQa test |
---|---|---|---|
TFIDF Nearest Question | 22% | 17% | 16% |
REALM | 40% | 36% | 35% |
T5XXL | 37% | 32% | 32% |
DPR | 41% | 37% | 36% |
DPR subset | 35% | 30% | 30% |
If you use this data, please cite Kwiatkowski et.al. 2019 and Lee et.al. 2019.