Skip to content

Commit

Permalink
Upload code
Browse files Browse the repository at this point in the history
  • Loading branch information
sunnweiwei committed Dec 13, 2023
1 parent 3df8fcf commit c051e5c
Show file tree
Hide file tree
Showing 10 changed files with 2,441 additions and 3 deletions.
40 changes: 37 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
# Learning to Tokenize for Generative Retrieval

Code of the paper [Learning to Tokenize for Generative Retrieval](https://arxiv.org/abs/2304.04171).
# Learning to Tokenize for Generative Retrieval

Code of the paper [Learning to Tokenize for Generative Retrieval](https://arxiv.org/abs/2304.04171).

![Model](assets/model.png)

## Environment
pytorch, transformers, accelerate, faiss, k_means_constrained

## Dataset
NQ320K: unzip `dataset/nq320k.zip`

Other datasets coming soon.

## Training and Evaluation
Code for GenRet on NQ320K:
```bash
python run.py --model_name t5-base --code_num 512 --max_length 3 --train_data dataset/nq320k/train.json --dev_data dataset/nq320k/dev.json --corpus_data dataset/nq320k/corpus_lite.json --save_path out/model
```


Code for generative retrieval baselines: `baseline.py`

Code for dense retrieval baselines: `dpr.py`

## Cite
```
@article{Sun2023LearningTT,
title={Learning to Tokenize for Generative Retrieval},
author={Weiwei Sun and Lingyong Yan and Zheng Chen and Shuaiqiang Wang and Haichao Zhu and Pengjie Ren and Zhumin Chen and Dawei Yin and M. de Rijke and Zhaochun Ren},
journal={ArXiv},
year={2023},
volume={abs/2304.04171},
}
```


Binary file added assets/model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit c051e5c

Please sign in to comment.