README writing

hughplay · Nov 15, 2023 · 5052d08 · 5052d08
1 parent 7137315
commit 5052d08
Show file tree

Hide file tree

Showing 5 changed files with 57 additions and 14 deletions.
diff --git a/READEME.md → README.md b/READEME.md → README.md
@@ -7,16 +7,16 @@ This repo provides the source code and checkpoints for our paper [Scalable and E
 - conda install -c conda-forge faiss-gpu 
 
 ## Inference 
-We use 4 A100 GPUs to run the model. It takes rougly 20 mins to do preprocessing and 90 mins for whole evaluation. You can use other types of GPUS like V100, but might take longer time.
+We use 4 A100 GPUs to run the model. You can use other types of GPUS like V100, but might take longer time.
 ``` 
 bash full_scripts/full_evaluate_t5seq_aq_encoder.sh 
 ```
-
+The results you obtain should be the same as reported in our paper.
 ## Training
 Our framework contains multiple training phases (see details from Figure 2 in the paper). You can train it sequentially from the starting phase or we provide the checkpoint for each phase that you can directly use it for the subsequent phases. 
 
-### Phase 1: Relevance-Based DocID initialization ($M^0$)
-You will start from `t5-base` and obtain the model `$M^0$` after this phase. This phase treat the T5 model as a dense encoder, and we use the two-stage training strategy to train it. In first stage, we use the BM25 negatives. You should run the following script to train the model:
+### Phase 1: Relevance-Based DocID initialization ( $M^0$ )
+You will start from `t5-base` and obtain the model $M^0$ after this phase. This phase treat the T5 model as a dense encoder, and we use the two-stage training strategy to train it. In first stage, we use the BM25 negatives. You should run the following script to train the model:
 ```
 bash full_scripts/full_train_t5seq_encoder_0.sh
 ```
@@ -62,10 +62,28 @@ Download all files from `experiments-full-t5seq-aq/t5_docid_gen_encoder_1` and `
 bash full_scripts/full_lng_knp_train_pipline.sh
 ```
 #### If you do not skip the phase 1 and phase 2
-You are a hard-working person that train all models by yourself. You are only one step away from success! But be patient, it might take some time. Since we build the DocID by ourselves, we should generate our own training data. Follow the following procedures for data generation. 
-- Apply the constrained beam search on $M^2$ to generate data for different prefix length:
-    Change the `task` variable in line 3 as `t5seq_aq_get_qid_to_smtid_rankdata` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
+You are only one step away from success! But be patient, it might take some time. Since we build the DocID by ourselves, we should generate our own training data. Follow the following procedures for data generation. 
+- Apply the constrained beam search on $M^2$ to generate training data for different prefix length:
+    Change the `task` variable in line 3 as `task=t5seq_aq_get_qid_to_smtid_rankdata` in script `full_scripts/full_evaluate_t5seq_aq_encoder.sh`. Then run the script:
     ```
     full_scripts/full_evaluate_t5seq_aq_encoder.sh
     ```
-- 
+    Note that in our paper (Sec 3.3.3), we call the training data as $\mathcal{D}^B$
+- In our paper (Sec 3.3.3), we combine $\mathcal{D}^B$ with training data $\mathcal{D}^R$ provided from the dense encoder provided by $M^0$. To let $\mathcal{D}^R$ having the same format as $$\mathcal{D}^B$, we run the following scripts:
+    ```
+    python t5_pretrainer/aq_preprocess/get_qid_smtid_docids_from_teacher_rerank_data.py 
+    ```
+    ```
+    bash full_scripts/rerank_qid_smtid_docids_0.sh
+    ```
+- We combine the $\mathcal{D}^B$ and $\mathcal{D}^R$ and create the training examples for this phase by the following scripts:
+    ```
+    python t5_pretrainer/aq_preprocess/get_qid_smtids_scores_jsonl_examples.py
+    ```
+    ```
+    python t5_pretrainer/aq_preprocess/fully_create_lng_knp_examples_from_original_examples.py
+    ```
+Awesome! You have all files needed for training. Run the script for training:
+```
+bash full_scripts/full_lng_knp_train_pipline.sh 
+```
diff --git a/full_scripts/rerank_qid_smtid_docids_0.sh b/full_scripts/rerank_qid_smtid_docids_0.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+train_queries_path="./data/msmarco-full/all_train_queries/train_queries/raw.tsv"
+collection_path=./data/msmarco-full/full_collection
+
+# need to change every time
+for max_new_token in 4 8 16 32
+do
+    data_dir=./experiments-full-lexical-ripor/t5_docid_gen_encoder_1/
+
+    qid_smtid_docids_path=$data_dir/sub_smtid_"$max_new_token"_out/qid_smtid_docids.train.json
+
+    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+    python -m torch.distributed.launch --nproc_per_node=8 -m t5_pretrainer.rerank \
+        --train_queries_path=$train_queries_path \
+        --collection_path=$collection_path \
+        --model_name_or_path=cross-encoder/ms-marco-MiniLM-L-6-v2 \
+        --max_length=256 \
+        --batch_size=256 \
+        --qid_smtid_docids_path=$qid_smtid_docids_path \
+        --task=cross_encoder_rerank_for_qid_smtid_docids
+
+    python -m t5_pretrainer.rerank \
+        --out_dir=$data_dir/sub_smtid_"$max_new_token"_out \
+        --task=cross_encoder_rerank_for_qid_smtid_docids_2
+done
diff --git a/t5_pretrainer/aq_preprocess/fully_create_lng_knp_examples_from_original_examples.py b/t5_pretrainer/aq_preprocess/fully_create_lng_knp_examples_from_original_examples.py
@@ -14,7 +14,7 @@
     for smtid, factor in smitd_to_factor.items():
         print("smtid: ", smtid, "factor: ", factor)
 
-    root_dir = "/home/ec2-user/quic-efs/user/hansizeng/work/RIPOR/experiments-full-t5seq-aq/t5seq_aq_encoder_seq2seq_1/"
+    root_dir = "./experiments-full-t5seq-aq/t5seq_aq_encoder_seq2seq_1/"
     source_example_path = os.path.join(root_dir, f"sub_smtid_train_decay2/qid_smtids_scores_{max_new_token}.train.json")
     out_dir = os.path.join(root_dir, "lng_knp_sub_smtid_train_decay2")
 

diff --git a/t5_pretrainer/aq_preprocess/get_qid_smtid_docids_from_teacher_rerank_data.py b/t5_pretrainer/aq_preprocess/get_qid_smtid_docids_from_teacher_rerank_data.py
@@ -3,9 +3,9 @@
 import numpy as np
 
 for max_new_token in [4, 8, 16, 32]:
-    docid_to_smtid_path = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/aq_smtid/docid_to_smtid.json"
-    teacher_score_path = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/out/MSMARCO_TRAIN/qrel_added_qid_docids_teacher_scores.train.json"
-    out_dir=f"/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/"
+    docid_to_smtid_path = "./t5_docid_gen_encoder_1/aq_smtid/docid_to_smtid.json"
+    teacher_score_path = "./t5_docid_gen_encoder_1/out/MSMARCO_TRAIN/qrel_added_qid_docids_teacher_scores.train.json"
+    out_dir=f"./t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/"
 
     if not os.path.exists(out_dir):
         os.mkdir(out_dir)

diff --git a/t5_pretrainer/aq_preprocess/get_qid_smtids_scores_jsonl_examples.py b/t5_pretrainer/aq_preprocess/get_qid_smtids_scores_jsonl_examples.py
@@ -3,7 +3,7 @@
 import numpy as np 
 
 # max_new_token = 8
-for max_new_token in [16, 32]:
+for max_new_token in [4, 8, 16, 32]:
     decay = 2
     keep_top100 = True
     decay_to_factor = {
@@ -15,7 +15,7 @@
     factor = decay_to_factor[decay][max_new_token]
     print("factor: ", factor, "max_new_token: ", max_new_token, "keep_top100: ", keep_top100)
 
-    root_dir = "/home/ec2-user/quic-efs/user/hansizeng/work/t5_pretrainer/t5_pretrainer/experiments-full-4-4096-28-256-t5seq-aq/"
+    root_dir = "./experiments-full-t5seq-aq/"
     original_qid_smtid_rerank_path = os.path.join(root_dir, f"t5_docid_gen_encoder_1/sub_smtid_{max_new_token}_out/qid_smtid_docids_teacher_score.train.json")
     self_qid_smtid_rerank_path = os.path.join(root_dir, f"t5seq_aq_encoder_seq2seq_1/sub_smtid_{max_new_token}_out/qid_smtid_docids_teacher_score.train.json")
     out_dir = os.path.join(root_dir, f"t5seq_aq_encoder_seq2seq_1/sub_smtid_train_decay{decay}/")