a

ielab · May 7, 2023 · 0e47b5c · 0e47b5c
1 parent f9c80a3
commit 0e47b5c
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -82,24 +82,58 @@ python3 matchmaker/train.py \
 ### To inference
 
 Note: please specify the path of trained models in config/dense_retrieval/model/example.yaml
+On the other hand, creating your own yaml file is also possible. Please refer to the example.yaml for the format.
 ````
 python3 matchmaker/dense_retrieval.py encode+index+search --run-name {cmodel_choice} \
         --config config/dense_rertrieval/base_setting.yaml config/dense_rertrieval/dataset/msmarco_dev.yaml config/dense_retrieval/model/example.yaml
-
 ````
 
 
 ## To reproduce for Amazon_shopping_queries
 
 ### Data preparation
 
+Download data from [Amazon_shopping_queries](https://github.com/amazon-science/esci-data)
+
+````
+git clone https://github.com/amazon-science/esci-data.git
+mv esci-data/shopping_queries_dataset/ dataset/amazon/
+````
+
+you can prepare the files for products by running the code below:
+
+````
+python3 pre_processing/aamazon_collection_convert_to_json.py \
+    --input dataset/amazon/collection_amazon.tsv \
+    --output dataset/amazon/collection_amazon.json
+````
+
+Then for queries:
+
+
+
+### Train the baseline DR model
+
+
+### Inference the baseline DR model
+
+
+### Get teacher emsembled scores
+
+
+### Train the distilled model
+
+
+### Inference the distilled model
+
+
 
 
 
 
 
 
-# Train a Dense Retriever (BERT_DOT) with TAS-Balanced & Dual-Supervision
+# Below is the code instruction from original paper that you may find helpful
 
 This guide builds up on: [dense_retrieval_train.md](dense_retrieval_train.md)
 

diff --git a/pre_processing/amazon_collection_convert_to_json.py b/pre_processing/amazon_collection_convert_to_json.py
@@ -0,0 +1,21 @@
+import json
+from tqdm import tqdm
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--input', type=str,help='input collection location', default="collection_amazon.tsv")
+parser.add_argument('--output', type=str,help='output collection location', default="../../pyserini/collections/amazon/collection_amazon.jsonl")
+args = parser.parse_args()
+
+file_in = args.input
+
+
+fw = open(args.output, 'w')
+
+with open(file_in) as f:
+    for line in tqdm(f):
+        id, content = line.strip().split('\t')
+        current_dict = {"id": id,
+                        "contents": content}
+        fw.write(json.dumps(current_dict) + '\n')
+