Skip to content

Commit

Permalink
a
Browse files Browse the repository at this point in the history
  • Loading branch information
wshuai190 committed May 7, 2023
1 parent f9c80a3 commit 0e47b5c
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 2 deletions.
38 changes: 36 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,24 +82,58 @@ python3 matchmaker/train.py \
### To inference

Note: please specify the path of trained models in config/dense_retrieval/model/example.yaml
On the other hand, creating your own yaml file is also possible. Please refer to the example.yaml for the format.
````
python3 matchmaker/dense_retrieval.py encode+index+search --run-name {cmodel_choice} \
--config config/dense_rertrieval/base_setting.yaml config/dense_rertrieval/dataset/msmarco_dev.yaml config/dense_retrieval/model/example.yaml
````


## To reproduce for Amazon_shopping_queries

### Data preparation

Download data from [Amazon_shopping_queries](https://github.com/amazon-science/esci-data)

````
git clone https://github.com/amazon-science/esci-data.git
mv esci-data/shopping_queries_dataset/ dataset/amazon/
````

you can prepare the files for products by running the code below:

````
python3 pre_processing/aamazon_collection_convert_to_json.py \
--input dataset/amazon/collection_amazon.tsv \
--output dataset/amazon/collection_amazon.json
````

Then for queries:



### Train the baseline DR model


### Inference the baseline DR model


### Get teacher emsembled scores


### Train the distilled model


### Inference the distilled model








# Train a Dense Retriever (BERT_DOT) with TAS-Balanced & Dual-Supervision
# Below is the code instruction from original paper that you may find helpful

This guide builds up on: [dense_retrieval_train.md](dense_retrieval_train.md)

Expand Down
21 changes: 21 additions & 0 deletions pre_processing/amazon_collection_convert_to_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import json
from tqdm import tqdm
import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str,help='input collection location', default="collection_amazon.tsv")
parser.add_argument('--output', type=str,help='output collection location', default="../../pyserini/collections/amazon/collection_amazon.jsonl")
args = parser.parse_args()

file_in = args.input


fw = open(args.output, 'w')

with open(file_in) as f:
for line in tqdm(f):
id, content = line.strip().split('\t')
current_dict = {"id": id,
"contents": content}
fw.write(json.dumps(current_dict) + '\n')

0 comments on commit 0e47b5c

Please sign in to comment.