From 863ff361fd671bb79b07f8f89a4b8121b7b46e8e Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Fri, 21 Jul 2023 16:33:33 -0400 Subject: [PATCH] Refactor + augment onboarding docs for MS MARCO passage (#1574) Add interactive retrieval section. --- docs/experiments-msmarco-doc.md | 2 +- docs/experiments-msmarco-passage.md | 105 +++++++++++++++++++++++----- 2 files changed, 89 insertions(+), 18 deletions(-) diff --git a/docs/experiments-msmarco-doc.md b/docs/experiments-msmarco-doc.md index 16ab2be02..83e2784b0 100644 --- a/docs/experiments-msmarco-doc.md +++ b/docs/experiments-msmarco-doc.md @@ -3,7 +3,7 @@ This guide contains instructions for running BM25 baselines on the [MS MARCO *document* ranking task](https://microsoft.github.io/msmarco/), which is nearly identical to a [similar guide in Anserini](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md), except that everything is in Python here (no Java). Note that there is a separate guide for the [MS MARCO *passage* ranking task](experiments-msmarco-passage.md). -As of July 2023, this exercise has been removed from the Waterloo students [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), which [starts here](start-here.md). +As of July 2023, this exercise has been removed from the Waterloo students [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), which [starts here](https://github.com/castorini/anserini/blob/master/docs/start-here.md). ## Data Prep diff --git a/docs/experiments-msmarco-passage.md b/docs/experiments-msmarco-passage.md index 7701b36f4..78d61dc87 100644 --- a/docs/experiments-msmarco-passage.md +++ b/docs/experiments-msmarco-passage.md @@ -4,7 +4,7 @@ This guide contains instructions for running BM25 baselines on the [MS MARCO *pa Note that there is a separate guide for the [MS MARCO *document* ranking task](experiments-msmarco-doc.md). This exercise will require a machine with >8 GB RAM and >15 GB free disk space. -If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md), +If you're a Waterloo student traversing the [onboarding path](https://github.com/lintool/guide/blob/master/ura.md) (which [starts here](https://github.com/castorini/anserini/blob/master/docs/start-here.md)), make sure you've first done the [BM25 Baselines for MS MARCO Passage Ranking **in Anserini**](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md). In general, if you don't understand what it is that you're doing when following this guide, i.e., you're just [cargo culting](https://en.wikipedia.org/wiki/Cargo_cult_programming) (i.e., blindly copying and pasting commands into a shell), then you should back up to the previous guide in the onboarding path. @@ -110,7 +110,8 @@ python -m pyserini.search.lucene \ --output runs/run.msmarco-passage.bm25tuned.txt \ --output-format msmarco \ --hits 1000 \ - --bm25 --k1 0.82 --b 0.68 + --bm25 --k1 0.82 --b 0.68 \ + --threads 4 --batch-size 16 ``` Here, we set the BM25 parameters to `k1=0.82`, `b=0.68` (tuned by grid search). @@ -127,10 +128,10 @@ For example, setting `--threads 16 --batch-size 64` on a CPU with sufficient cor ## Evaluation -After the run finishes, we can evaluate the results using the official MS MARCO evaluation script: +After the run finishes, we can evaluate the results using the official MS MARCO evaluation script, which has been incorporated into Pyserini: ```bash -$ python tools/scripts/msmarco/msmarco_passage_eval.py \ +$ python -m pyserini.eval.msmarco_passage_eval \ tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ runs/run.msmarco-passage.bm25tuned.txt @@ -141,22 +142,24 @@ QueriesRanked: 6980 ``` We can also use the official TREC evaluation tool, `trec_eval`, to compute metrics other than MRR@10. -For that we first need to convert the run file into TREC format: +The tool needs a different run format, so it's easier to just run retrieval again: ```bash -python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.bm25tuned.txt \ - --output runs/run.msmarco-passage.bm25tuned.trec - -python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \ - --input tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \ - --output collections/msmarco-passage/qrels.dev.small.trec +python -m pyserini.search.lucene \ + --index indexes/lucene-index-msmarco-passage \ + --topics msmarco-passage-dev-subset \ + --output runs/run.msmarco-passage.bm25tuned.trec \ + --hits 1000 \ + --bm25 --k1 0.82 --b 0.68 \ + --threads 4 --batch-size 16 ``` -And then run the `trec_eval` tool: +The only difference here is that we've removed `--output-format msmarco`. + +Let's then run the `trec_eval` tool, which has been incorporated into Pyserini: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \ +$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap \ collections/msmarco-passage/qrels.dev.small.trec \ runs/run.msmarco-passage.bm25tuned.trec @@ -167,18 +170,86 @@ recall_1000 all 0.8573 If you want to examine the MRR@10 for `qid` 1048585: ```bash -$ tools/eval/trec_eval.9.0.4/trec_eval -q -c -M 10 -m recip_rank \ +$ python -m pyserini.eval.trec_eval -q -c -M 10 -m recip_rank \ collections/msmarco-passage/qrels.dev.small.trec \ - runs/run.msmarco-passage.dev.small.trec | grep 1048585 + runs/run.msmarco-passage.bm25tuned.trec | grep 1048585 recip_rank 1048585 1.0000 ``` Once again, if you can't make sense of what's going on here, back up and make sure you've first done the [BM25 Baselines for MS MARCO Passage Ranking **in Anserini**](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md). -Otherwise, that's it! +Otherwise, congratulations! You've done everything that you did in Anserini (in Java), but now in Pyserini (in Python). +## Interactive Retrieval + +There's one final thing we should go over. +Because we're in Python now, we get the benefit of having an interactive shell. +Thus, we can run Pyserini interactively. + +Try the following: + +```python +from pyserini.search.lucene import LuceneSearcher + +searcher = LuceneSearcher('indexes/lucene-index-msmarco-passage') +searcher.set_bm25(0.82, 0.68) +hits = searcher.search('what is paula deen\'s brother') + +for i in range(0, 10): + print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}') +``` + +The `LuceneSearcher` class provides search capabilities for BM25. +In the code snippet above, we're issuing the query about Paula Deen's brother (from above). +Note that we're explicitly setting the BM25 parameters, which are not the default parameters. +We get back a list of results (`hits`), which we then iterate through and print out: + +``` + 1 7187158 18.81160 + 2 7187157 18.33340 + 3 7187163 17.87880 + 4 7546327 16.96210 + 5 7187160 16.56470 + 6 8227279 16.43250 + 7 7617404 16.23990 + 8 7187156 16.02490 + 9 2298838 15.70150 +10 7187155 15.51330 +``` + +You can confirm that the output is the same as `pyserini.search.lucene` from above. + +```bash +$ grep 1048585 runs/run.msmarco-passage.bm25tuned.trec | head -10 +1048585 Q0 7187158 1 18.811600 Anserini +1048585 Q0 7187157 2 18.333401 Anserini +1048585 Q0 7187163 3 17.878799 Anserini +1048585 Q0 7546327 4 16.962099 Anserini +1048585 Q0 7187160 5 16.564699 Anserini +1048585 Q0 8227279 6 16.432501 Anserini +1048585 Q0 7617404 7 16.239901 Anserini +1048585 Q0 7187156 8 16.024900 Anserini +1048585 Q0 2298838 9 15.701500 Anserini +1048585 Q0 7187155 10 15.513300 Anserini +``` + +To pull up the actual contents of a hit: + +```python +hits[0].raw +``` + +And you should get: + +``` +'{\n "id" : "7187158",\n "contents" : "Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦ Paula Deen and her brother Earl W. Bubba Hiers are being sued by a former general manager at Uncle Bubba\'sâ\x80¦"\n}' +``` + +Everything make sense? +If so, now you're truly done with this guide! + Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use `yyyy-mm-dd`, make sure you're using a commit id that's on the main trunk of Anserini, and use its 7-hexadecimal prefix for the link anchor text. ## Reproduction Log[*](reproducibility.md)