[👑 NeurIPS 2022 Outstanding Paper] A Neural Corpus Indexer for Document Retrieval -- NCI (Paper)
NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:
Model | Recall@1 | Recall@10 | Recall@100 | MRR@100 |
---|---|---|---|---|
NCI w/ qg-ft (Ensemble) | 72.78 | 91.76 | 96.22 | 80.12 |
NCI (Ensemble) | 70.46 | 89.35 | 94.75 | 77.82 |
NCI w/ qg-ft (Large) | 68.65 | 88.45 | 94.53 | 76.10 |
NCI w/ qg-ft (Base) | 68.91 | 88.48 | 94.48 | 76.17 |
NCI (Large) | 66.23 | 85.27 | 92.49 | 73.37 |
NCI (Base) | 65.86 | 85.20 | 92.42 | 73.12 |
DSI (T5-Base) | 27.40 | 56.60 | -- | -- |
DSI (T5-Large) | 35.60 | 62.60 | -- | -- |
SEAL (Large) | 59.93 | 81.24 | 90.93 | 67.70 |
ANCE (MaxP) | 52.63 | 80.38 | 91.31 | 62.84 |
BM25 + DocT5Query | 35.43 | 61.83 | 76.92 | 44.47 |
For more information, checkout our publications: https://arxiv.org/abs/2206.02743
[1] Install Anaconda.
[2] Clone repository:
git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI
[3] Create conda environment:
conda env create -f environment.yml
conda activate NCI
[4] Docker:
If necessary, the NCI docker is mzmssg/corpus_env:latest.
You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.
[1] Dataset Download.
Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.
[2] Semantic Identifier
NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.
[3] Query Generation
In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.
NCI uses docTTTTTquery checkpoint to generate synthetic queries. If you finetune docTTTTTquery checkpoint, the query generation files can make the retrieval result even better. We show how to finetune the model. The following command will finetune the model for 4k iterations to predict queries. We assume you put the tsv training file in gs://your_bucket/qcontent_train_512.csv (download from above). Also, change your_tpu_name, your_tpu_zone, your_project_id, and your_bucket accordingly.
t5_mesh_transformer \
--tpu="your_tpu_name" \
--gcp_project="your_project_id" \
--tpu_zone="your_tpu_zone" \
--model_dir="gs://your_bucket/models/" \
--gin_param="init_checkpoint = 'gs://your_bucket/model.ckpt-1004000'" \
--gin_file="dataset.gin" \
--gin_file="models/bi_v1.gin" \
--gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
--gin_param="tsv_dataset_fn.filename = 'gs://your_bucket/qcontent_train_512.csv'" \
--gin_file="learning_rate_schedules/constant_0_001.gin" \
--gin_param="run.train_steps = 1008000" \
--gin_param="tokens_per_batch = 131072" \
--gin_param="utils.tpu_mesh_shape.tpu_topology ='v2-8'"
Please refer to docTTTTTquery documentation.
Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.
Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').
Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.
Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.
If you find this work useful for your research, please cite:
@article{wang2022neural,
title={A neural corpus indexer for document retrieval},
author={Wang, Yujing and Hou, Yingyan and Wang, Haonan and Miao, Ziming and Wu, Shibin and Chen, Qi and Xia, Yuqing and Chi, Chengmin and Zhao, Guoshuai and Liu, Zheng and others},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={25600--25614},
year={2022}
}
We learned a lot and borrowed some code from the following projects when building NCI.