Skip to content

Commit

Permalink
README writing
Browse files Browse the repository at this point in the history
  • Loading branch information
EC2 Default User committed Nov 15, 2023
1 parent 7137315 commit 5052d08
Show file tree
Hide file tree
Showing 5 changed files with 57 additions and 14 deletions.
34 changes: 26 additions & 8 deletions READEME.md → README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ This repo provides the source code and checkpoints for our paper [Scalable and E
- conda install -c conda-forge faiss-gpu

## Inference
We use 4 A100 GPUs to run the model. It takes rougly 20 mins to do preprocessing and 90 mins for whole evaluation. You can use other types of GPUS like V100, but might take longer time.
We use 4 A100 GPUs to run the model. You can use other types of GPUS like V100, but might take longer time.
```
bash full_scripts/full_evaluate_t5seq_aq_encoder.sh
```

The results you obtain should be the same as reported in our paper.
## Training
Our framework contains multiple training phases (see details from Figure 2 in the paper). You can train it sequentially from the starting phase or we provide the checkpoint for each phase that you can directly use it for the subsequent phases.

### Phase 1: Relevance-Based DocID initialization ($M^0$)
You will start from `t5-base` and obtain the model `$M^0$` after this phase. This phase treat the T5 model as a dense encoder, and we use the two-stage training strategy to train it. In first stage, we use the BM25 negatives. You should run the following script to train the model:
### Phase 1: Relevance-Based DocID initialization ( $M^0$ )
You will start from `t5-base` and obtain the model $M^0$ after this phase. This phase treat the T5 model as a dense encoder, and we use the two-stage training strategy to train it. In first stage, we use the BM25 negatives. You should run the following script to train the model:
```
bash full_scripts/full_train_t5seq_encoder_0.sh
```
Expand Down Expand Up @@ -62,10 +62,28 @@ Download all files from `experiments-full-t5seq-aq/t5_docid_gen_encoder_1` and `
bash full_scripts/full_lng_knp_train_pipline.sh
```
#### If you do not skip the phase 1 and phase 2
You are a hard-working person that train all models by yourself. You are only one step away from success! But be patient, it might take some time. Since we build the DocID by ourselves, we should generate our own training data. Follow the following procedures for data generation.
- Apply the constrained beam search on $M^2$ to generate data for different prefix length:
Change the `task` variable in line 3 as `t5seq_aq_get_qid_to_smtid_rankdata` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
You are only one step away from success! But be patient, it might take some time. Since we build the DocID by ourselves, we should generate our own training data. Follow the following procedures for data generation.
- Apply the constrained beam search on $M^2$ to generate training data for different prefix length:
Change the `task` variable in line 3 as `task=t5seq_aq_get_qid_to_smtid_rankdata` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
```
full_scripts/full_evaluate_t5seq_aq_encoder.sh
```
-
Note that in our paper (Sec 3.3.3), we call the training data as $\mathcal{D}^B$
- In our paper (Sec 3.3.3), we combine $\mathcal{D}^B$ with training data $\mathcal{D}^R$ provided from the dense encoder provided by $M^0$. To let $\mathcal{D}^R$ having the same format as $$\mathcal{D}^B$, we run the following scripts:
```
python t5_pretrainer/aq_preprocess/get_qid_smtid_docids_from_teacher_rerank_data.py
```
```
bash full_scripts/rerank_qid_smtid_docids_0.sh
```
- We combine the $\mathcal{D}^B$ and $\mathcal{D}^R$ and create the training examples for this phase by the following scripts:
```
python t5_pretrainer/aq_preprocess/get_qid_smtids_scores_jsonl_examples.py
```
```
python t5_pretrainer/aq_preprocess/fully_create_lng_knp_examples_from_original_examples.py
```
Awesome! You have all files needed for training. Run the script for training:
```
bash full_scripts/full_lng_knp_train_pipline.sh
```
25 changes: 25 additions & 0 deletions full_scripts/rerank_qid_smtid_docids_0.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash
train_queries_path="./data/msmarco-full/all_train_queries/train_queries/raw.tsv"
collection_path=./data/msmarco-full/full_collection

# need to change every time
for max_new_token in 4 8 16 32
do
data_dir=./experiments-full-lexical-ripor/t5_docid_gen_encoder_1/

qid_smtid_docids_path=$data_dir/sub_smtid_"$max_new_token"_out/qid_smtid_docids.train.json

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.rerank \
--train_queries_path=$train_queries_path \
--collection_path=$collection_path \
--model_name_or_path=cross-encoder/ms-marco-MiniLM-L-6-v2 \
--max_length=256 \
--batch_size=256 \
--qid_smtid_docids_path=$qid_smtid_docids_path \
--task=cross_encoder_rerank_for_qid_smtid_docids

python -m t5_pretrainer.rerank \
--out_dir=$data_dir/sub_smtid_"$max_new_token"_out \
--task=cross_encoder_rerank_for_qid_smtid_docids_2
done
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
for smtid, factor in smitd_to_factor.items():
print("smtid: ", smtid, "factor: ", factor)

root_dir = "/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/experiments-full-t5seq-aq/t5seq_aq_encoder_seq2seq_1/"
root_dir = "./experiments-full-t5seq-aq/t5seq_aq_encoder_seq2seq_1/"
source_example_path = os.path.join(root_dir, f"sub_smtid_train_decay2/qid_smtids_scores_{max_new_token}.train.json")
out_dir = os.path.join(root_dir, "lng_knp_sub_smtid_train_decay2")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
import numpy as np

for max_new_token in [4, 8, 16, 32]:
docid_to_smtid_path = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/aq_smtid/docid_to_smtid.json"
teacher_score_path = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/out/MSMARCO_TRAIN/qrel_added_qid_docids_teacher_scores.train.json"
out_dir=f"/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/"
docid_to_smtid_path = "./t5_docid_gen_encoder_1/aq_smtid/docid_to_smtid.json"
teacher_score_path = "./t5_docid_gen_encoder_1/out/MSMARCO_TRAIN/qrel_added_qid_docids_teacher_scores.train.json"
out_dir=f"./t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/"

if not os.path.exists(out_dir):
os.mkdir(out_dir)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import numpy as np

# max_new_token = 8
for max_new_token in [16, 32]:
for max_new_token in [4, 8, 16, 32]:
decay = 2
keep_top100 = True
decay_to_factor = {
Expand All @@ -15,7 +15,7 @@
factor = decay_to_factor[decay][max_new_token]
print("factor: ", factor, "max_new_token: ", max_new_token, "keep_top100: ", keep_top100)

root_dir = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/"
root_dir = "./experiments-full-t5seq-aq/"
original_qid_smtid_rerank_path = os.path.join(root_dir, f"t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/qid_smtid_docids_teacher_score.train.json")
self_qid_smtid_rerank_path = os.path.join(root_dir, f"t5seq_aq_encoder_seq2seq_1/sub_smtid_{max_new_token}_out/qid_smtid_docids_teacher_score.train.json")
out_dir = os.path.join(root_dir, f"t5seq_aq_encoder_seq2seq_1/sub_smtid_train_decay{decay}/")
Expand Down

0 comments on commit 5052d08

Please sign in to comment.