Skip to content

Commit

Permalink
Update experiments-monot5-tpu.md (#191)
Browse files Browse the repository at this point in the history
* Update experiments-monot5-tpu.md
  • Loading branch information
larryli1999 authored May 18, 2021
1 parent 169e971 commit d50c6b0
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/experiments-monot5-tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ cd ../../

Then convert the train triples file to the monoT5 input format:
```
python pygaggle/data/create_msmarco_t5_training_pairs --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
python pygaggle/data/create_msmarco_t5_training_pairs.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
```

Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
Expand All @@ -59,8 +59,8 @@ We download the query, qrels, run and corpus files corresponding to the MS MARCO
The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

In short, the files are:
- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `topics.msmarco-passage.dev-subset.txt`: 6,980 queries from the MS MARCO dev set.
- `qrels.msmarco-passage.dev-subset.txt`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
- `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.

Expand All @@ -70,8 +70,8 @@ Let's start.
```
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
tar -xvf collection.tar.gz
rm collection.tar.gz
Expand All @@ -81,7 +81,7 @@ cd ../../

As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
```
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.dev.small.tsv
```

The output should be:
Expand All @@ -94,7 +94,7 @@ QueriesRanked: 6980

Then, we prepare the query-doc pairs in the monoT5 input format.
```
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/topics.msmarco-passage.dev-subset.txt \
--run ${DATA_DIR}/run.dev.small.tsv \
--corpus ${DATA_DIR}/collection.tsv \
--t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
Expand Down Expand Up @@ -225,7 +225,7 @@ python pygaggle/data/convert_monot5_output_to_msmarco_run.py --t5_output ${DATA_

Now we can evaluate the reranked results using the official MS MARCO evaluation script.
```
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
```

In the case of monoT5-3B, the output should be:
Expand Down

0 comments on commit d50c6b0

Please sign in to comment.