Skip to content

Commit

Permalink
start to write READEME
Browse files Browse the repository at this point in the history
  • Loading branch information
EC2 Default User committed Nov 15, 2023
1 parent 2f80814 commit 7137315
Show file tree
Hide file tree
Showing 13 changed files with 419 additions and 116 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# customized
experiments-full-t5seq-aq/
wandb/
data/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
75 changes: 71 additions & 4 deletions READEME.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,71 @@
# Package installation
pip install -r requirement.txt
pip install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge faiss-cpu
# Scalable and Effective Generative Information Retrieval
This repo provides the source code and checkpoints for our paper [Scalable and Effective Generative Information Retrieval]() (RIPOR)

## Package installation
- pip install -r requirement.txt
- pip install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
- conda install -c conda-forge faiss-gpu

## Inference
We use 4 A100 GPUs to run the model. It takes rougly 20 mins to do preprocessing and 90 mins for whole evaluation. You can use other types of GPUS like V100, but might take longer time.
```
bash full_scripts/full_evaluate_t5seq_aq_encoder.sh
```

## Training
Our framework contains multiple training phases (see details from Figure 2 in the paper). You can train it sequentially from the starting phase or we provide the checkpoint for each phase that you can directly use it for the subsequent phases.

### Phase 1: Relevance-Based DocID initialization ($M^0$)
You will start from `t5-base` and obtain the model `$M^0$` after this phase. This phase treat the T5 model as a dense encoder, and we use the two-stage training strategy to train it. In first stage, we use the BM25 negatives. You should run the following script to train the model:
```
bash full_scripts/full_train_t5seq_encoder_0.sh
```
Run the following script for the second stage training:
```
bash full_scripts/full_train_t5seq_encoder_1.sh
```
Now, you obtain the model $M^0$. Congrats! Let's use the $M^0$ to get the DocID for each document. Before running the script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`, you should change the `task` variable in line 3 as `task=all_aq_pipline`. After that, you run this script:
```
bash full_scripts/full_evaluate_t5seq_aq_encoder.sh
```
### Phase 2: Seq2Seq Pretraining + Initial Fine-tuning ($M^2$)
You will start from $M^0$ and obtain $M^2$ after this phase
#### If you skip the phase 1
Download all files from folder `experiments-full-t5seq-aq/t5_docid_gen_encoder_1` in which it contains training files and checkpoint you need for this phase.
Run the script:
```
bash full_scripts/full_train_t5seq_seq2seq_0_1_pipeline.sh
```
#### If you train the $M^0$ by yourself in phase 1
You should create your own training set with the following procedure:
- Change the `task` variable in line 3 as `task=retrieve_train_queries` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
```
full_scripts/full_evaluate_t5seq_aq_encoder.sh
```
- Use the teacher model (cross-encoder) to rerank the obtained run.json file
```
full_scripts/rerank_for_create_trainset.sh
```
- Add the qrel (relevant docid) for training set
```
python t5_pretrainer/aq_preprocess/add_qrel_to_rerank_run.py
```
- Then run the script:
```
bash full_scripts/full_train_t5seq_seq2seq_0_1_pipeline.sh
```
### Phase 3: Prefix-Oriented Ranking Optimization ($M^3$)
#### If you skip the phase 1 and phase 2
Download all files from `experiments-full-t5seq-aq/t5_docid_gen_encoder_1` and `experiments-full-t5seq-aq/t5seq_aq_encoder_seq2seq_1`, they provide you with checkpoints, training data, and initialized DocIDs. You start from $M^2$ and obtain the checkpoint $M^3$ after that. Run the script:
```
bash full_scripts/full_lng_knp_train_pipline.sh
```
#### If you do not skip the phase 1 and phase 2
You are a hard-working person that train all models by yourself. You are only one step away from success! But be patient, it might take some time. Since we build the DocID by ourselves, we should generate our own training data. Follow the following procedures for data generation.
- Apply the constrained beam search on $M^2$ to generate data for different prefix length:
Change the `task` variable in line 3 as `t5seq_aq_get_qid_to_smtid_rankdata` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
```
full_scripts/full_evaluate_t5seq_aq_encoder.sh
```
-
100 changes: 68 additions & 32 deletions full_scripts/full_evaluate_t5seq_aq_encoder.sh
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
#!/bin/bash

task=t5seq_aq_retrieve_docids_use_sub_smtid
data_root_dir=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full
data_root_dir=./data/msmarco-full
collection_path=$data_root_dir/full_collection/
q_collection_paths='["/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2019/queries_2019/","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2020/queries_2020/","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/dev_queries/"]'
eval_qrel_path='["/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/dev_qrel.json","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2019/qrel.json","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2019/qrel_binary.json","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2020/qrel.json","/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/TREC_DL_2020/qrel_binary.json"]'
q_collection_paths='["./data/msmarco-full/TREC_DL_2019/queries_2019/","./data/msmarco-full/TREC_DL_2020/queries_2020/","./data/msmarco-full/dev_queries/"]'
eval_qrel_path='["./data/msmarco-full/dev_qrel.json","./data/msmarco-full/TREC_DL_2019/qrel.json","./data/msmarco-full/TREC_DL_2019/qrel_binary.json","./data/msmarco-full/TREC_DL_2020/qrel.json","./data/msmarco-full/TREC_DL_2020/qrel_binary.json"]'
experiment_dir=experiments-full-t5seq-aq

if [ $task = all_aq_pipline ]; then
echo "task: $task"

model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/sentence-t5"
model_dir="./$experiment_dir/t5_docid_gen_encoder_1"
pretrained_path=$model_dir/checkpoint
index_dir=$model_dir/aq_index
mmap_dir=$model_dir/mmap
Expand Down Expand Up @@ -47,12 +47,16 @@ if [ $task = all_aq_pipline ]; then
--model_dir=$model_dir \
--M=32 \
--bits=8

python -m t5_pretrainer.aq_preprocss.change_embed_layer \
--model_dir=$model_dir

elif [ $task = aq_to_flat_index_search_evaluate ]; then
echo "task: $task"
data_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
data_dir="./$experiment_dir/t5_docid_gen_encoder_1"
docid_to_smtid_path=$data_dir/aq_smtid/docid_to_smtid.json

model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
model_dir="./$experiment_dir/t5_docid_gen_encoder_1"
pretrained_path=$model_dir/no_share_checkpoint
index_dir=$model_dir/aq_flat_index
out_dir=$model_dir/aq_flat_out
Expand All @@ -69,7 +73,7 @@ elif [ $task == "retrieve_train_queries" ]; then

# the model_dir should be changed every time
experiment_dir=experiments-full-t5seq-aq
model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
model_dir="./$experiment_dir/t5_docid_gen_encoder_1"
index_dir=$model_dir/index
out_dir=$model_dir/out/
pretrained_path=$model_dir/checkpoint
Expand All @@ -79,13 +83,13 @@ elif [ $task == "retrieve_train_queries" ]; then
--pretrained_path=$pretrained_path \
--index_dir=$index_dir \
--out_dir=$out_dir \
--q_collection_paths='["/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/all_train_queries/train_queries"]' \
--q_collection_paths='["./data/msmarco-full/all_train_queries/train_queries"]' \
--topk=100 \
--encoder_type=t5seq_pretrain_encoder
elif [ $task = all_pipline ]; then
echo "task: $task"

model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
model_dir="./$experiment_dir/t5_docid_gen_encoder_1"
pretrained_path=$model_dir/checkpoint
index_dir=$model_dir/index
out_dir=$model_dir/out
Expand Down Expand Up @@ -113,39 +117,71 @@ elif [ $task = all_pipline ]; then
elif [ $task = "t5seq_aq_get_qid_to_smtid_rankdata" ]; then
export CUDA_VISIBLE_DEVICES=0,1,2,3
echo "task: $task"
data_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
data_dir="./$experiment_dir/t5_docid_gen_encoder_1"
docid_to_smtid_path=$data_dir/aq_smtid/docid_to_smtid.json

model_dir=/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1_lng_knp_mnt_32_dcy_2
model_dir=./$experiment_dir/t5seq_aq_encoder_seq2seq_1
pretrained_path=$model_dir/checkpoint
train_query_dir="/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/all_train_queries/train_queries/"

# need to remove later
max_new_token=16

out_dir=$model_dir/sub_smtid_"${max_new_token}"_out/
python -m torch.distributed.launch --nproc_per_node=4 -m t5_pretrainer.evaluate \
--pretrained_path=$pretrained_path \
--out_dir=$out_dir \
--task=$task \
--docid_to_smtid_path=$docid_to_smtid_path \
--topk=100 \
--batch_size=4 \
--train_query_dir=$train_query_dir \
--max_new_token=$max_new_token

python -m t5_pretrainer.evaluate \
--task="$task"_2 \
--out_dir=$out_dir
train_query_dir="./data/msmarco-full/all_train_queries/train_queries/"

# Apply beam search to generate prefix with length 4, 8, 16
for max_new_token in 4 8 16
do
out_dir=$model_dir/sub_smtid_"${max_new_token}"_out/
python -m torch.distributed.launch --nproc_per_node=4 -m t5_pretrainer.evaluate \
--pretrained_path=$pretrained_path \
--out_dir=$out_dir \
--task=$task \
--docid_to_smtid_path=$docid_to_smtid_path \
--topk=100 \
--batch_size=4 \
--train_query_dir=$train_query_dir \
--max_new_token=$max_new_token

python -m t5_pretrainer.evaluate \
--task="$task"_2 \
--out_dir=$out_dir

python t5_pretrainer/aq_preprocess/argparse_from_qid_smtid_rank_to_qid_smtid_docids.py \
--root_dir=$out_dir
done

# Since prefix=32 and prefix=16 almost corresponds to the same doc. To save time, we directly expand it from 16 to 32.
python t5_pretrainer/aq_preprocess/expand_smtid_for_qid_smtid_docids.py \
--data_dir=$data_dir \
--src_qid_smtid_rankdata_path=$model_dir/sub_smtid_16_out/qid_smtid_rankdata.json \
--out_dir=$model_dir/sub_smtid_32_out

python t5_pretrainer/aq_preprocess/argparse_from_qid_smtid_rank_to_qid_smtid_docids.py \
--root_dir=$model_dir/sub_smtid_32_out

# let's rerank the data
for max_new_token in 4 8 16 32
do
qid_smtid_docids_path=$model_dir/sub_smtid_"$max_new_token"_out/qid_smtid_docids.train.json

python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.rerank \
--train_queries_path=$train_queries_path \
--collection_path=$collection_path \
--model_name_or_path=cross-encoder/ms-marco-MiniLM-L-6-v2 \
--max_length=256 \
--batch_size=256 \
--qid_smtid_docids_path=$qid_smtid_docids_path \
--task=cross_encoder_rerank_for_qid_smtid_docids

python -m t5_pretrainer.rerank \
--out_dir=$model_dir/sub_smtid_"$max_new_token"_out \
--task=cross_encoder_rerank_for_qid_smtid_docids_2
done
elif [ $task = "t5seq_aq_retrieve_docids_use_sub_smtid" ]; then
export CUDA_VISIBLE_DEVICES=0,1,2,3
echo "task: $task"
data_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
data_dir="./$experiment_dir/t5_docid_gen_encoder_1"
docid_to_smtid_path=$data_dir/aq_smtid/docid_to_smtid.json

# need to modify for a new experiment
max_new_token=32
model_dir=/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1_lng_knp_self_mnt_32_dcy_2/
model_dir=./$experiment_dir/t5seq_aq_encoder_seq2seq_1_lng_knp_self_mnt_32_dcy_2/
pretrained_path=$model_dir/checkpoint
out_dir=$model_dir/out_docid_from_sub_"$max_new_token"_top1000/

Expand Down
28 changes: 13 additions & 15 deletions full_scripts/full_lng_knp_train_pipline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,22 @@ experiment_dir=experiments-full-t5seq-aq

# when max_new_token=4
task=t5seq_aq_encoder_margin_mse_sub_smtid
data_root_dir=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full
data_root_dir=./data/msmarco-full
collection_path=$data_root_dir/full_collection/
queries_path=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/all_train_queries/train_queries
queries_path=./data/msmarco-full/all_train_queries/train_queries

data_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
data_dir="./$experiment_dir/t5_docid_gen_encoder_1"
docid_to_smtid_path=$data_dir/aq_smtid/docid_to_smtid.json
output_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/"
output_dir="./$experiment_dir/"

# need to change for every experiment
model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1"
model_dir="./$experiment_dir/t5seq_aq_encoder_seq2seq_1"
pretrained_path=$model_dir/checkpoint

# also need to be changed by condition
decay=2
max_new_token=4
teacher_score_dir=/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1/
teacher_score_dir=./$experiment_dir/t5seq_aq_encoder_seq2seq_1/
teacher_score_path=$teacher_score_dir/sub_smtid_train_decay"$decay"/qid_smtids_scores_"$max_new_token".train.json
run_name=t5seq_aq_encoder_seq2seq_1_lng_knp_mnt_"$max_new_token"_dcy_"$decay"

Expand All @@ -44,8 +44,6 @@ python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.main \
--pretrained_path=$pretrained_path \
--smtid_as_docid

exit 1

for max_new_token in 8 16 32
do
if [ $max_new_token -eq 8 ]; then
Expand All @@ -63,26 +61,26 @@ do
fi
echo $teacher_score_path

data_root_dir=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full
data_root_dir=./data/msmarco-full
collection_path=$data_root_dir/full_collection/
queries_path=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/all_train_queries/train_queries
queries_path=./data/msmarco-full/all_train_queries/train_queries

data_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5_docid_gen_encoder_1"
data_dir="./$experiment_dir/t5_docid_gen_encoder_1"
docid_to_smtid_path=$data_dir/aq_smtid/docid_to_smtid.json
output_dir="/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/"
output_dir="./$experiment_dir/"

# need to change for every experiment
model_dir=/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1_lng_knp_mnt_"$prev_token"_dcy_2
model_dir=./$experiment_dir/t5seq_aq_encoder_seq2seq_1_lng_knp_mnt_"$prev_token"_dcy_2
pretrained_path=$model_dir/checkpoint/

# also need to be changed by condition
decay=2
teacher_score_dir=/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/$experiment_dir/t5seq_aq_encoder_seq2seq_1/
teacher_score_dir=./$experiment_dir/t5seq_aq_encoder_seq2seq_1/
teacher_score_path=$teacher_score_dir/lng_knp_sub_smtid_train_decay"$decay"/lng_knp_qid_smtids_scores_"$max_new_token".train.json
run_name=t5seq_aq_encoder_seq2seq_1_lng_knp_mnt_"$max_new_token"_dcy_"$decay"

python -m torch.distributed.launch --nproc_per_node=7 -m t5_pretrainer.main \
--epochs=100 \
--epochs=120 \
--run_name=$run_name \
--learning_rate=1e-4 \
--loss_type=t5seq_aq_encoder_lng_knp_margin_mse \
Expand Down
32 changes: 32 additions & 0 deletions full_scripts/full_train_t5seq_encoder_0.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash

data_root_dir=./data/msmarco-full
collection_path=$data_root_dir/full_collection/
queries_path=./data/msmarco-full/all_train_queries/train_queries

# model dir
experiment_dir=experiments-full-t5seq-aq
pretrained_path=t5-base

# train_examples path
teacher_score_path=./data/msmarco-full/bm25_run/qrel_added_qid_docids_teacher_scores.train.json
run_name=t5_docid_gen_encoder_0
output_dir="./$experiment_dir/"

python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.main \
--epochs=50 \
--run_name=$run_name \
--learning_rate=1e-4 \
--loss_type=t5seq_pretrain_margin_mse \
--model_name_or_path=t5-base \
--model_type=t5_docid_gen_encoder \
--teacher_score_path=$teacher_score_path \
--output_dir=$output_dir \
--task_names='["rank"]' \
--wandb_project_name=full_t5seq_encoder \
--use_fp16 \
--collection_path=$collection_path \
--max_length=128 \
--per_device_train_batch_size=64 \
--queries_path=$queries_path \
--pretrained_path=$pretrained_path
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
#!/bin/bash

data_root_dir=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full
data_root_dir=./data/msmarco-full
collection_path=$data_root_dir/full_collection/
queries_path=/home/ec2-user/quic-efs/user/hansizeng/work/data/msmarco-full/all_train_queries/train_queries
queries_path=./data/msmarco-full/all_train_queries/train_queries

# model dir
experiment_dir=experiments-full-t5seq-aq
model_dir="/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/$experiment_dir/t5_docid_gen_encoder_0"
model_dir="./$experiment_dir/t5_docid_gen_encoder_0"
pretrained_path=$model_dir/checkpoint/

# train_examples path
teacher_score_path=$model_dir/all_train/MSMARCO_TRAIN/qrel_added_qid_docids_teacher_scores.train.json
run_name=t5_docid_gen_encoder_1
output_dir="/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/$experiment_dir/"
output_dir="./$experiment_dir/"

python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.main \
--epochs=50 \
Expand Down
Loading

0 comments on commit 7137315

Please sign in to comment.