castorini · ronakice · May 18, 2021 · May 18, 2021 · May 18, 2021 · May 18, 2021
diff --git a/docs/experiments-monot5-tpu.md b/docs/experiments-monot5-tpu.md
@@ -42,7 +42,7 @@ cd ../../
 
 Then convert the train triples file to the monoT5 input format:
 ```
-python pygaggle/data/create_msmarco_t5_training_pairs --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
+python pygaggle/data/create_msmarco_t5_training_pairs.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
 ```
 
 Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
@@ -59,8 +59,8 @@ We download the query, qrels, run and corpus files corresponding to the MS MARCO
 The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).
 
 In short, the files are:
-- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
-- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
+- `topics.msmarco-passage.dev-subset.txt`: 6,980 queries from the MS MARCO dev set.
+- `qrels.msmarco-passage.dev-subset.txt`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
 - `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
 - `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.
 
@@ -70,8 +70,8 @@ Let's start.
 ```
 cd ${DATA_DIR}
 wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
-wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
-wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv
+wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
+wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
 wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
 tar -xvf collection.tar.gz
 rm collection.tar.gz
@@ -81,7 +81,7 @@ cd ../../
 
 As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
 ```
-python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
+python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.dev.small.tsv
 ```
 
 The output should be:
@@ -94,7 +94,7 @@ QueriesRanked: 6980
 
 Then, we prepare the query-doc pairs in the monoT5 input format.
 ```
-python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
+python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/topics.msmarco-passage.dev-subset.txt \
                                       --run ${DATA_DIR}/run.dev.small.tsv \
                                       --corpus ${DATA_DIR}/collection.tsv \
                                       --t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
@@ -225,7 +225,7 @@ python pygaggle/data/convert_monot5_output_to_msmarco_run.py --t5_output ${DATA_
 
 Now we can evaluate the reranked results using the official MS MARCO evaluation script.
 ```
-python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
+python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
 ```
 
 In the case of monoT5-3B, the output should be: