Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken link for qrel.dev.small.tsv #191

Merged
merged 4 commits into from
May 18, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions docs/experiments-monot5-tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ cd ../../

Then convert the train triples file to the monoT5 input format:
```
python pygaggle/data/create_msmarco_t5_training_pairs --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
python pygaggle/data/create_msmarco_t5_training_pairs.py --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_doc_pairs.train.tsv
```

Next, copy the monoT5 input file to Google Storage. TPU training will read data directly from `gs`.
Expand All @@ -59,8 +59,8 @@ We download the query, qrels, run and corpus files corresponding to the MS MARCO
The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

In short, the files are:
- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `topics.msmarco-passage.dev-subset.txt`: 6,980 queries from the MS MARCO dev set.
- `qrels.msmarco-passage.dev-subset.txt`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
- `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.

Expand All @@ -70,8 +70,8 @@ Let's start.
```
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
wget https://www.dropbox.com/s/5t6e2225rt6ikym/qrels.dev.small.tsv
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
tar -xvf collection.tar.gz
rm collection.tar.gz
Expand All @@ -81,7 +81,7 @@ cd ../../

As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
```
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.dev.small.tsv
```

The output should be:
Expand All @@ -94,7 +94,7 @@ QueriesRanked: 6980

Then, we prepare the query-doc pairs in the monoT5 input format.
```
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/topics.msmarco-passage.dev-subset.txt \
--run ${DATA_DIR}/run.dev.small.tsv \
--corpus ${DATA_DIR}/collection.tsv \
--t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
Expand Down Expand Up @@ -225,7 +225,7 @@ python pygaggle/data/convert_monot5_output_to_msmarco_run.py --t5_output ${DATA_

Now we can evaluate the reranked results using the official MS MARCO evaluation script.
```
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
```

In the case of monoT5-3B, the output should be:
Expand Down