This is the reproduce code for the paper "Balanced Topic Aware Sampling for Effective Dense Retriever: A Reproducibility Study" by Shuai Wang and Guido Zuccon.
In our project, there are two datasets that's been used: MS_MARCO_passage_ranking and Amazon_shopping_queries
- Download the dataset from MS_MARCO_passage_ranking and put it in the folder
dataset/msmarco/
. (You need: collection.tsv, all queries.tsv, all qrels.tsv, train_triples_qid_pid) - Download the pre-computed training scores from ensemble scores and put it in the folder
dataset/msmarco/
.
- Set-up pyserini
- Follow the instructions on get bm25 score of MS marco passage ranking in pyserini
- Put the output of top-100 BM25 results in the folder
dataset/msmarco/
and name it asbm25-top-100-49k.tsv
. - Put the output of top-1000 BM25 results in the folder
dataset/msmarco/
and name it asbm25-top-1000-49k.tsv
To train baseline DR model, do the following:
cd matchmaker
python3 matchmaker/train.py \
--config-file config/train/defaults.yaml config/train/data/example-minial-dataset.yaml config/train/models/bert_dot.yaml \
--run-name msmarco-baseline-model
To inference the baseline DR model for msmarco_dev 49k, do the following:
python3 matchmaker/dense_retrieval.py encode+index+search \
--run-name baseline_dr_49k \
--config config/dense_retrieval/base_setting.yaml config/dense_retrieval/dataset/msmarco_dev_49k.yaml config/dense_retrieval/model/base.yaml
Remember to change model path in config/dense_retrieval/model/base.yaml to the trained model from above step.
Then generate evaluation using the following:
cd ..
python3 pre_processing/all_result_to_trec_n_evaluate.py \
--input_folder TAS-query-original/baseline_dr_49k #Please Specify the inferenced folder adress from the last step \
--qrel dataset/msmarco/qrel.dev.tsv \
--trec_eval trec_eval/trec_eval #Please Do specify the location of your trec_eval in your system
To seperate training file, run the following
cd matchmaker
python3 generate_smart_earlystopping_retrieval.py
--output-file ../dataset/msmarco/sampled_queries.tsv \
--candidate-metric ../TAS-query-original/baseline_dr_49k/ndcg.trec \
--candidate-file ../dataset/msmarco/bm25-top-100-49k.tsv\
--qrel ../dataset/msmarco/qrels.dev.small.tsv \
--collection-file ../dataset/msmarco/collection.tsv \
--query-file ../dataset/msmarco/queries.dev.small.tsv
Run the following code for query clustering preprocessing
python3 matchmaker/distillation/query_clusterer.py
--run-name msmarco_query_clustered \
--config-file config/train/defaults.yaml config/train/cluster_model.yaml
The expected outcome includes two files: cluster-assignment-ids.tsv and cluster-assignment-text.tsv.
For model_choice: please see all available yaml files in config/train/modes/exp/
cd matchmaker
python3 matchmaker/train.py \
--config-file config/train/defaults.yaml config/train/data/msmarco.yaml config/train/models/bert_dot.yaml config/train/modes/exp/{model_choice}.yaml \
--run-name msmarco-${model_choice}
Note: please specify the path of trained models in config/dense_retrieval/model/example.yaml On the other hand, creating your own yaml file is also possible. Please refer to the example.yaml for the format.
python3 matchmaker/dense_retrieval.py encode+index+search --run-name {cmodel_choice} \
--config config/dense_rertrieval/base_setting.yaml config/dense_rertrieval/dataset/msmarco_dev.yaml config/dense_retrieval/model/example.yaml
Note, you can also change the dataset by changing the msmarco_dev.yaml to trec_dl_2019.yaml or trec_dl_2020.yaml
Download data from Amazon_shopping_queries
git clone https://github.com/amazon-science/esci-data.git
mv esci-data/shopping_queries_dataset/* dataset/amazon/
First pre-process the data by running the following code:
python3 pre_processing/amazon_data/get_amazon_task1.py --folder dataset/amazon/
Then for documents, further process using the following code
python3 pre_processing/amazon_data/amazon_collection_convert_to_json.py \
--input dataset/amazon/collection_amazon.tsv \
--output dataset/amazon/collection_amazon.json
Run the following code for query clustering preprocessing
python3 matchmaker/distillation/query_clusterer.py
--run-name amazon_query_clustered \
--config-file config/train/defaults.yaml config/train/cluster_model_amazon.yaml
Start to train the baseline DR model by running the following code:
cd matchmaker
python3 matchmaker/train.py \
--config-file config/train/defaults.yaml config/train/data/amazon_data.yaml config/train/models/bert_dot.yaml \
--run-name amazon_baseline_model
python3 matchmaker/dense_retrieval.py encode+index+search \
--run-name baseline_dr_amazon \
--config config/dense_retrieval/base_setting.yaml config/dense_retrieval/dataset/amazon_train.yaml config/dense_retrieval/model/base.yaml
To get teacher emsembled scores, first, you need to get scores of training queries using BERT, ALBERT, and BERT-LARGE
First, preprocess the data by running the following code:
python3 pre_processing/amazon_data/create_qid_pid_triples.py --folder dataset/amazon/
python3 pre_processing/amazon_data/generate_scoring_file.py
Then, run the following code to get scores of training queries using BERT, ALBERT, and BERT-LARGE
python3 pre_processing/amazon_data/run_teacher_scores.py
python3 fusing_teacher_scorings.py
For model_choice: please see all available yaml files in config/train/modes/exp/
cd matchmaker
python3 matchmaker/train.py \
--config-file config/train/defaults.yaml config/train/data/amazon_data.yaml config/train/models/bert_dot.yaml config/train/modes/exp/{model_choice}.yaml \
--run-name amazon-${model_choice}
Note: please specify the path of trained models in config/dense_retrieval/model/example.yaml On the other hand, creating your own yaml file is also possible. Please refer to the example.yaml for the format.
python3 matchmaker/dense_retrieval.py encode+index+search --run-name {model_choice} \
--config config/dense_rertrieval/base_setting.yaml config/dense_rertrieval/dataset/amazon_data.yaml config/dense_retrieval/model/example.yaml
This guide builds up on: dense_retrieval_train.md
We again use a BERT_DOT with shared encoder: bert_dot
For the dataset config, see the config/train/data/example-minimal-tasb-dataset.yaml.
The config for TAS-Balanced input files:
dynamic_query_file: "/path/to/train/queries.train.tsv"
dynamic_collection_file: "/path/to/train/collection.tsv"
dynamic_pairs_with_teacher_scores: "/path/to/train/T2-train-ids.tsv" # output of matchmaker/distillation/teacher_textscore_to_ids.py (from single model pairwise scores or the ensemble)
dynamic_query_cluster_file: "/path/to/train/train_clusters.tsv" # generate it with matchmaker/distillation/query_clusterer.py (and a baseline dense retrieval model)
Example train command (cd in ir-project-matchmaker):
python matchmaker/train.py --config-file config/train/defaults.yaml config/train/data/<your dataset here>.yaml config/train/models/bert_dot.yaml config/train/modes/tas_balanced.yaml --run-name your_experiment_name
The dynamic_teacher will run on the same GPU as the main training (if you only have 1 GPU available) if you have more than 1 GPU available, the dynamic teacher will exclusively take the last GPU and the training the rest.
If you want to know more about the needed workflow to set up teacher scores & models: distillation_workflow_dual-supervision.md
If you want to know more about the evaluation options see: dense_retrieval_evaluate.md