Skip to content

Commit

Permalink
Update commands for MS MARCO V2 reproduction - uniCOIL noexp and uniC…
Browse files Browse the repository at this point in the history
…OIL + TILDE (#891)
  • Loading branch information
lintool authored Nov 29, 2021
1 parent 00a7d66 commit d101ffa
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 29 deletions.
24 changes: 15 additions & 9 deletions docs/experiments-msmarco-v2-unicoil-tilde-expansion.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,31 +26,35 @@ First, we need to download and extract the MS MARCO V2 passage dataset with uniC

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/tb3m3J45HFJNAbq/download -O collections/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/tb3m3J45HFJNAbq/download -O collections/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar

tar -xvf collections/msmarco-v2-passage-unicoil-tilde-expansion-b8.tar -C collections/
tar -xvf collections/msmarco-passage-v2-unicoil-tilde-expansion-b8.tar -C collections/
```

To confirm, `msmarco-v2-passage-unicoil-tilde-expansion-b8.tar` is around 58 GB and should have an MD5 checksum of `acc4c9bc3506c3a496bf3e009fa6e50b`.
To confirm, `msmarco-passage-v2-unicoil-tilde-expansion-b8.tar` is around 58 GB and should have an MD5 checksum of `acc4c9bc3506c3a496bf3e009fa6e50b`.

## Indexing

We can now index these docs:

```
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-v2-passage-unicoil-tilde-expansion-b8/ \
-index indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
python -m pyserini.index --collection JsonVectorCollection \
--input collections/msmarco-passage-v2-unicoil-tilde-expansion-b8/ \
--index indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8 \
--generator DefaultLuceneDocumentGenerator \
--threads 12 \
--impact \
--pretokenized
```

The important indexing options to note here are `-impact -pretokenized`: the first tells Pyserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.

Upon completion, we should have an index with 138,364,198 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 5 hours.

<!-- This is deprecated because we have pre-built indexes. Retaining for historic reasons.
If you want to save time and skip the indexing step, download the prebuilt index directly:
```bash
Expand All @@ -64,6 +68,8 @@ tar -xzvf indexes/lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8.tar
To confirm, `lucene-index.msmarco-v2-passage-unicoil-tilde-expansion-b8.tar.gz` is around 30 GB and should have an MD5 checksum of `0f9b1f90751d49dd3a66be54dd0b4f82`.
This pre-built index was created with the above command, but with the addition of the `-optimize` option to merge index segments.
-->

## Retrieval

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-tilde` in the command below.
Expand Down
54 changes: 34 additions & 20 deletions docs/experiments-msmarco-v2-unicoil.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ We are working on figuring out ways to distribute the indexes.
## Zero-Shot uniCOIL

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model and we did not have time to finish doc2query-T5 expansions.
Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md).
Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described [here](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-unicoil.md).

Specifically, we applied inference over the MS MARCO V2 [passage corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#passage-collection) and [segmented document corpus](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-v2.md#document-collection-segmented) to obtain the term weights.

Expand All @@ -28,18 +28,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
Download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-v2-passage-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-v2-passage-unicoil-noexp-0shot-b8.tar -C collections/
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar

tar -xvf collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -C collections/
```

To confirm, `msmarco-passage-v2-unicoil-noexp-0shot-b8.tar` is 24 GB and has an MD5 checksum of `fcf21991103197a7e8823b0e2045aca1`.

Index the sparse vectors:

```bash
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-v2-passage-unicoil-noexp-0shot-b8 \
-index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 32
python -m pyserini.index --collection JsonVectorCollection \
--input collections/msmarco-passage-v2-unicoil-noexp-0shot-b8 \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-passage-unicoil-noexp-0shot` in the command below.
Expand All @@ -49,7 +56,7 @@ Sparse retrieval with uniCOIL:
```bash
python -m pyserini.search --topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
--impact \
--hits 1000 \
Expand Down Expand Up @@ -85,18 +92,25 @@ As an alternative, we also make available pre-built indexes (in which case the i
Download the sparse representation of the corpus generated by uniCOIL:

```bash
wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8.tar -C collections/
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar

tar -xvf collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -C collections/
```

To confirm, `msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar` is 54 GB and has an MD5 checksum of `af54061ab5c2ce6cf90a1e60fd92924c`.

Index the sparse vectors:

```bash
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-v2-doc-seg-unicoil-noexp-0shot-b8 \
-index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 32
python -m pyserini.index --collection JsonVectorCollection \
--input collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
```

> If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use `--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot` in the command below.
Expand All @@ -106,8 +120,8 @@ Sparse retrieval with uniCOIL:
```bash
python -m pyserini.search --topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
--output runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
--impact \
--hits 10000 \
--batch 144 \
Expand All @@ -122,12 +136,12 @@ For the document corpus, since we are searching the segmented version, we retrie
To evaluate, using `trec_eval`:

```bash
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
map all 0.2012
recip_rank all 0.2032

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-document-v2-segmented.unicoil-noexp.0shot.txt
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
recall_100 all 0.7190
recall_1000 all 0.8813
Expand Down

0 comments on commit d101ffa

Please sign in to comment.