Skip to content

Commit

Permalink
Add more Rocchio conditions for MS MARCO v1 and V2 (#1921)
Browse files Browse the repository at this point in the history
Additional changes:
+ Tweaks to experiments-msmarco-passage.md and experiments-msmarco-doc.md
+ Fixed (some) incorrect dates on when tuning was performed for MS MARCO v1/v2 doc/passage (and d2q-T5)
+ Added missing tuned2 conditions to dl19-doc
+ Added missing ax/bm25prf conditions to dl20-doc and msmarco-doc
+ Fixed bug in neg Rocchio condition on passage d2q (-rerankCutoff 1000)
  • Loading branch information
lintool authored Jun 28, 2022
1 parent e90bb3c commit 8010d5c
Show file tree
Hide file tree
Showing 43 changed files with 1,269 additions and 232 deletions.
125 changes: 75 additions & 50 deletions docs/experiments-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This page contains instructions for running BM25 baselines on the [MS MARCO *document* ranking task](https://microsoft.github.io/msmarco/).
Note that there is a separate [MS MARCO *passage* ranking task](experiments-msmarco-passage.md).

**Setup Note:** If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources as the tasks such as building the index could take some time to finish such as RAM > 8GB and storage > 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.
This exercise will require a machine with >8 GB RAM and at least 40 GB free disk space.

If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, make sure you do the [passage ranking exercise](experiments-msmarco-passage.md) first.
Similarly, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell).
Expand All @@ -13,7 +13,7 @@ Similarly, try to understand what you're actually doing, instead of simply [carg
We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO document dataset:

```
```bash
mkdir collections/msmarco-doc

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
Expand All @@ -30,10 +30,14 @@ To confirm, `msmarco-docs.trec.gz` should have MD5 checksum of `d4863e4f342982b5
There's no need to uncompress the file, as Anserini can directly index gzipped files.
Build the index with the following command:

```
sh target/appassembler/bin/IndexCollection -threads 1 -collection CleanTrecCollection \
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc \
-index indexes/msmarco-doc/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw
```bash
target/appassembler/bin/IndexCollection \
-collection CleanTrecCollection \
-input collections/msmarco-doc \
-index indexes/msmarco-doc/lucene-index-msmarco \
-generator DefaultLuceneDocumentGenerator \
-threads 1 \
-storePositions -storeDocvectors -storeRaw
```

On a modern desktop with an SSD, indexing takes around 40 minutes.
Expand All @@ -45,11 +49,14 @@ There should be a total of 3,213,835 documents indexed.
After indexing finishes, we can do a retrieval run.
The dev queries are already stored in our repo:

```
target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.dev.bm25.txt -bm25
```bash
target/appassembler/bin/SearchCollection \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.dev.bm25.txt \
-parallelism 4 \
-bm25 -hits 1000
```

Retrieval speed will vary by machine:
Expand All @@ -58,28 +65,31 @@ Adjust the parallelism by changing the `-parallelism` argument.

After the run completes, we can evaluate with `trec_eval`:

```
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map all 0.2310
recall_1000 all 0.8856
```

Let's compare to the baselines provided by Microsoft.
First, download:

```
```bash
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz
```

Then, run `trec_eval` to compare.
Note that to be fair, we restrict evaluation to top 100 hits per topic (which is what Microsoft provides):

```
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map all 0.2219

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -M 100 \
src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.bm25.txt
map all 0.2303
```

Expand All @@ -91,18 +101,22 @@ Let's try to reproduce runs on there!
A few minor details to pay attention to: the official metric is MRR@100, so we want to only return the top 100 hits, and the submission files to the leaderboard have a slightly different format.

```bash
target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
-bm25 -bm25.k1 0.9 -bm25.b 0.4
target/appassembler/bin/SearchCollection \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25base.txt -format msmarco \
-parallelism 4 \
-bm25 -bm25.k1 0.9 -bm25.b 0.4 -hits 100
```

The command above uses the default BM25 parameters (`k1=0.9`, `b=0.4`), and note we set `-hits 100`.
Command for evaluation:

```bash
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt
$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.leaderboard-dev.bm25base.txt
#####################
MRR @100: 0.23005723505603573
QueriesRanked: 5193
Expand All @@ -114,17 +128,21 @@ The above run corresponds to "Anserini's BM25, default parameters (k1=0.9, b=0.4
Here's the invocation for BM25 with parameters optimized for recall@100 (`k1=4.46`, `b=0.82`):

```bash
target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
-bm25 -bm25.k1 4.46 -bm25.b 0.82
target/appassembler/bin/SearchCollection \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt -format msmarco \
-parallelism 4 \
-bm25 -bm25.k1 4.46 -bm25.b 0.82 -hits 100
```

Command for evaluation:

```bash
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt
$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.leaderboard-dev.bm25tuned.txt
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
Expand All @@ -139,7 +157,7 @@ It is well known that BM25 parameter tuning is important.
The setting of `k1=0.9`, `b=0.4` is often used as a default.

Let's try to do better!
We tuned BM25 using the queries found [here](https://github.com/castorini/Anserini-data/tree/master/MSMARCO): these are five different sets of 10k samples from the training queries (using the `shuf` command).
We tuned BM25 using the queries found [here](https://github.com/castorini/anserini-data/tree/master/MSMARCO): these are five different sets of 10k samples from the training queries (using the `shuf` command).
The basic approach is grid search of parameter values in tenth increments.
We tuned on each individual set and then averaged parameter values across all five sets (this has the effect of regularization).
In separate trials, we optimized for:
Expand All @@ -151,35 +169,42 @@ It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between different parameter settings:

Setting | MRR@100 | MAP | Recall@1000 |
:----------------------------------------------------------------------|--------:|-------:|------------:|
Default (`k1=0.9`, `b=0.4`) | 0.2301 | 0.2310 | 0.8856 |
Optimized for MRR@100/MAP (`k1=3.8`, `b=0.87`) | 0.2784 | 0.2789 | 0.9326 |
Optimized for recall@100 (`k1=4.46`, `b=0.82`) | 0.2770 | 0.2775 | 0.9357 |
| Setting | MRR@100 | MAP | Recall@1000 |
|:-----------------------------------------------|--------:|-------:|------------:|
| Default (`k1=0.9`, `b=0.4`) | 0.2301 | 0.2310 | 0.8856 |
| Optimized for MRR@100/MAP (`k1=3.8`, `b=0.87`) | 0.2784 | 0.2789 | 0.9326 |
| Optimized for recall@100 (`k1=4.46`, `b=0.82`) | 0.2770 | 0.2775 | 0.9357 |

As expected, BM25 tuning makes a big difference!

Note that MRR@100 is computed with the leaderboard eval script (with 100 hits per query), while the other two metrics are computed with `trec_eval` (with 1000 hits per query).
So, we need to use different search programs, for example:

```
$ target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.dev.opt-mrr.txt \
-bm25 -bm25.k1 3.8 -bm25.b 0.87
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
```bash
$ target/appassembler/bin/SearchCollection \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.dev.opt-mrr.txt \
-parallelism 4 \
-bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 1000

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mmap -mrecall.1000 \
src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.dev.opt-mrr.txt
map all 0.2789
recall_1000 all 0.9326

$ target/appassembler/bin/SearchCollection -hits 100 -parallelism 4 \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topicreader TsvInt -topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
-bm25 -bm25.k1 3.8 -bm25.b 0.87
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt --run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
$ target/appassembler/bin/SearchCollection \
-index indexes/msmarco-doc/lucene-index-msmarco \
-topics src/main/resources/topics-and-qrels/topics.msmarco-doc.dev.txt \
-topicreader TsvInt \
-output runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt -format msmarco \
-parallelism 4 \
-bm25 -bm25.k1 3.8 -bm25.b 0.87 -hits 100

$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments src/main/resources/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.leaderboard-dev.opt-mrr.txt
#####################
MRR @100: 0.27836767424339787
QueriesRanked: 5193
Expand Down
56 changes: 29 additions & 27 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@

This page contains instructions for running BM25 baselines on the [MS MARCO *passage* ranking task](https://microsoft.github.io/msmarco/).
Note that there is a separate [MS MARCO *document* ranking task](experiments-msmarco-doc.md).
We also have a [separate page](experiments-doc2query.md) describing document expansion experiments (doc2query) for this task.

**Setup Note:** If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP) for this particular task, try to provision enough resources as the tasks could take some time to finish such as RAM > 6GB and storage ~ 100 GB (SSD). This will prevent going back and fixing machine configuration again and again.
This exercise will require a machine with >8 GB RAM and at least 15 GB free disk space .

If you're a Waterloo undergraduate going through this guide as the [screening exercise](https://github.com/lintool/guide/blob/master/ura.md) of joining my research group, try to understand what you're actually doing, instead of simply [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blinding copying and pasting commands into a shell).
In particular, you'll want to pay attention to the "What's going on here?" sections.
Expand Down Expand Up @@ -58,8 +57,8 @@ Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files

```bash
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
```

The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).
Expand All @@ -70,9 +69,12 @@ The above script should generate 9 jsonl files in `collections/msmarco-passage/c
We can now index these docs as a `JsonCollection` using Anserini:

```bash
sh target/appassembler/bin/IndexCollection -threads 9 -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-passage/collection_jsonl \
-index indexes/msmarco-passage/lucene-index-msmarco -storePositions -storeDocvectors -storeRaw
target/appassembler/bin/IndexCollection \
-collection JsonCollection \
-input collections/msmarco-passage/collection_jsonl \
-index indexes/msmarco-passage/lucene-index-msmarco \
-generator DefaultLuceneDocumentGenerator \
-threads 9 -storePositions -storeDocvectors -storeRaw
```

Upon completion, we should have an index with 8,841,823 documents.
Expand All @@ -85,9 +87,9 @@ Since queries of the set are too many (+100k), it would take a long time to retr

```bash
python tools/scripts/msmarco/filter_queries.py \
--qrels collections/msmarco-passage/qrels.dev.small.tsv \
--queries collections/msmarco-passage/queries.dev.tsv \
--output collections/msmarco-passage/queries.dev.small.tsv
--qrels collections/msmarco-passage/qrels.dev.small.tsv \
--queries collections/msmarco-passage/queries.dev.tsv \
--output collections/msmarco-passage/queries.dev.small.tsv
```

The output queries file should contain 6980 lines.
Expand Down Expand Up @@ -119,11 +121,13 @@ These queries are taken from Bing search logs, so they're "realistic" web querie
We can now perform a retrieval run using this smaller set of queries:
```bash
sh target/appassembler/bin/SearchCollection -hits 1000 -parallelism 4 \
-index indexes/msmarco-passage/lucene-index-msmarco \
-topicreader TsvInt -topics collections/msmarco-passage/queries.dev.small.tsv \
-output runs/run.msmarco-passage.dev.small.tsv -format msmarco \
-bm25 -bm25.k1 0.82 -bm25.b 0.68
target/appassembler/bin/SearchCollection \
-index indexes/msmarco-passage/lucene-index-msmarco \
-topics collections/msmarco-passage/queries.dev.small.tsv \
-topicreader TsvInt \
-output runs/run.msmarco-passage.dev.small.tsv -format msmarco \
-parallelism 4 \
-bm25 -bm25.k1 0.82 -bm25.b 0.68 -hits 1000
```
The above command uses BM25 with tuned parameters `k1=0.82`, `b=0.68`.
Expand Down Expand Up @@ -244,19 +248,19 @@ For that we first need to convert runs and qrels files to the TREC format:

```bash
python tools/scripts/msmarco/convert_msmarco_to_trec_run.py \
--input runs/run.msmarco-passage.dev.small.tsv \
--output runs/run.msmarco-passage.dev.small.trec
--input runs/run.msmarco-passage.dev.small.tsv \
--output runs/run.msmarco-passage.dev.small.trec

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
--input collections/msmarco-passage/qrels.dev.small.tsv \
--output collections/msmarco-passage/qrels.dev.small.trec
--input collections/msmarco-passage/qrels.dev.small.tsv \
--output collections/msmarco-passage/qrels.dev.small.trec
```

And run the `trec_eval` tool:

```bash
tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.dev.small.trec
```

The output should be:
Expand Down Expand Up @@ -296,13 +300,11 @@ It turns out that optimizing for MRR@10 and MAP yields the same settings.

Here's the comparison between the Anserini default and optimized parameters:

Setting | MRR@10 | MAP | Recall@1000 |
:---------------------------|-------:|-------:|------------:|
Default (`k1=0.9`, `b=0.4`) | 0.1840 | 0.1926 | 0.8526
Optimized for recall@1000 (`k1=0.82`, `b=0.68`) | 0.1874 | 0.1957 | 0.8573
Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`) | 0.1892 | 0.1972 | 0.8555

To reproduce these results, the `SearchMsmarco` class above takes `k1` and `b` parameters as command-line arguments, e.g., `-k1 0.60 -b 0.62` (note that the default setting is `k1=0.82` and `b=0.68`).
| Setting | MRR@10 | MAP | Recall@1000 |
|:------------------------------------------------|-------:|-------:|------------:|
| Default (`k1=0.9`, `b=0.4`) | 0.1840 | 0.1926 | 0.8526 |
| Optimized for recall@1000 (`k1=0.82`, `b=0.68`) | 0.1874 | 0.1957 | 0.8573 |
| Optimized for MRR@10/MAP (`k1=0.60`, `b=0.62`) | 0.1892 | 0.1972 | 0.8555 |

As mentioned above, the BM25 run with `k1=0.82`, `b=0.68` corresponds to the entry "BM25 (Lucene8, tuned)" dated 2019/06/26 on the [MS MARCO Passage Ranking Leaderboard](https://microsoft.github.io/msmarco/).
The BM25 run with default parameters `k1=0.9`, `b=0.4` roughly corresponds to the entry "BM25 (Anserini)" dated 2019/04/10 (but Anserini was using Lucene 7.6 at the time).
Expand Down
Loading

0 comments on commit 8010d5c

Please sign in to comment.