Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken link for qrel.dev.small.tsv #191

Merged
merged 4 commits into from
May 18, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update experiments-monot5-tpu.md
  • Loading branch information
larryli1999 authored May 18, 2021
commit e20e09c0e9db93ab6d730fc242e45f99a52b7aeb
12 changes: 6 additions & 6 deletions docs/experiments-monot5-tpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ We download the query, qrels, run and corpus files corresponding to the MS MARCO
The run file is generated by following the Anserini's [BM25 ranking instructions](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md).

In short, the files are:
- `queries.dev.small.tsv`: 6,980 queries from the MS MARCO dev set.
- `qrels.dev.small.tsv`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `topics.msmarco-passage.dev-subset.txt`: 6,980 queries from the MS MARCO dev set.
- `qrels.msmarco-passage.dev-subset.txt`: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.
- `run.dev.small.tsv`: Approximately 6,980,000 pairs of dev set queries and retrieved passages using Anserini's BM25.
- `collection.tar.gz`: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.

Expand All @@ -70,8 +70,8 @@ Let's start.
```
cd ${DATA_DIR}
wget https://storage.googleapis.com/duobert_git/run.bm25.dev.small.tsv
wget https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv
wget https://www.dropbox.com/s/ie27l0mzcjb5fbc/qrels.dev.small.tsv
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
wget https://raw.githubusercontent.com/castorini/anserini/master/src/main/resources/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt
wget https://www.dropbox.com/s/m1n2wf80l1lb9j1/collection.tar.gz
tar -xvf collection.tar.gz
rm collection.tar.gz
Expand All @@ -81,7 +81,7 @@ cd ../../

As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
```
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.dev.small.tsv ${DATA_DIR}/run.dev.small.tsv
python tools/scripts/msmarco/msmarco_passage_eval.py ${DATA_DIR}/qrels.msmarco-passage.dev-subset.txt ${DATA_DIR}/run.dev.small.tsv
```

The output should be:
Expand All @@ -94,7 +94,7 @@ QueriesRanked: 6980

Then, we prepare the query-doc pairs in the monoT5 input format.
```
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/queries.dev.small.tsv \
python pygaggle/data/create_msmarco_monot5_input.py --queries ${DATA_DIR}/topics.msmarco-passage.dev-subset.txt \
--run ${DATA_DIR}/run.dev.small.tsv \
--corpus ${DATA_DIR}/collection.tsv \
--t5_input ${DATA_DIR}/query_doc_pairs.dev.small.txt \
Expand Down