From 11ce24176f927b9ddafec161465fe04c35c02f00 Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Sun, 12 Dec 2021 14:30:47 -0500 Subject: [PATCH] Updating MS MARCO Elasticsearch document, minor tweaks in commands (#902) --- README.md | 30 ++++++++++++++------------- docs/experiments-elastic.md | 27 +++++++++++++++++++------ docs/experiments-msmarco-doc.md | 36 ++++++++++++++++++++------------- 3 files changed, 59 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 8c88398ac..48c51af2a 100644 --- a/README.md +++ b/README.md @@ -306,19 +306,20 @@ So, the quickest way to get started is to write a script that converts your docu Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well): ```bash -python -m pyserini.index -collection JsonCollection \ - -generator DefaultLuceneDocumentGenerator \ - -threads 1 \ - -input integrations/resources/sample_collection_jsonl \ - -index indexes/sample_collection_jsonl \ - -storePositions -storeDocvectors -storeRaw +python -m pyserini.index \ + --input integrations/resources/sample_collection_jsonl \ + --collection JsonCollection \ + --generator DefaultLuceneDocumentGenerator \ + --index indexes/sample_collection_jsonl \ + --threads 1 \ + --storePositions --storeDocvectors --storeRaw ``` Three options control the type of index that is built: -+ `-storePositions`: builds a standard positional index -+ `-storeDocvectors`: stores doc vectors (required for relevance feedback) -+ `-storeRaw`: stores raw documents ++ `--storePositions`: builds a standard positional index ++ `--storeDocvectors`: stores doc vectors (required for relevance feedback) ++ `--storeRaw`: stores raw documents If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies. This is sufficient for simple "bag of words" querying (and yields the smallest index size). @@ -349,12 +350,13 @@ Note that the file extension _must_ end in `.tsv` so that Pyserini knows what fo Then, you can run: ```bash -$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \ - --index indexes/sample_collection_jsonl \ - --output run.sample.txt \ - --bm25 +$ python -m pyserini.search \ + --topics integrations/resources/sample_queries.tsv \ + --index indexes/sample_collection_jsonl \ + --output run.sample.txt \ + --bm25 -$ cat run.sample.txt +$ cat run.sample.txt 1 Q0 doc2 1 0.256200 Anserini 1 Q0 doc3 2 0.231400 Anserini 2 Q0 doc1 1 0.534600 Anserini diff --git a/docs/experiments-elastic.md b/docs/experiments-elastic.md index f99b4496d..2472fb209 100644 --- a/docs/experiments-elastic.md +++ b/docs/experiments-elastic.md @@ -1,5 +1,7 @@ # Pyserini: Multi-field Baseline for MS MARCO Document Ranking + + This page contains instructions for reproducing the "Elasticsearch optimized multi_match best_fields" entry (2020/11/25) on the the [MS MARCO Document Ranking Leaderboard](https://microsoft.github.io/MSMARCO-Document-Ranking-Submissions/leaderboard/) using Pyserini. Details behind this run are described in this [blog post](https://www.elastic.co/blog/improving-search-relevance-with-data-driven-query-optimization); @@ -23,6 +25,10 @@ First, we need to download and extract the MS MARCO document dataset: ``` mkdir collections/msmarco-doc wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz -P collections/msmarco-doc + +# Alternative mirror: +# wget https://www.dropbox.com/s/zly8cbyvt18l3u0/msmarco-docs.tsv.gz -P collections/msmarco-doc + gunzip collections/msmarco-doc/msmarco-docs.tsv.gz ``` @@ -40,10 +46,14 @@ python tools/scripts/msmarco/convert_doc_collection_to_jsonl.py \ We then build the index with the following command: ```bash -python -m pyserini.index -threads 4 -collection JsonCollection \ - -generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc-json/ \ - -index indexes/msmarco-doc/lucene-index-msmarco -storeRaw \ - -stopwords docs/elastic-msmarco-stopwords.txt +python -m pyserini.index \ + --input collections/msmarco-doc-json/ \ + --collection JsonCollection \ + --generator DefaultLuceneDocumentGenerator \ + --index indexes/msmarco-doc/lucene-index-msmarco \ + --threads 4 \ + --storeRaw \ + --stopwords docs/elastic-msmarco-stopwords.txt ``` On a modern desktop with an SSD, indexing takes around 15 minutes. @@ -57,10 +67,12 @@ attention to: the official metric is MRR@100, so we want to only return the top format. ```bash -python -m pyserini.search --output-format msmarco --hits 100 \ +python -m pyserini.search \ --topics msmarco-doc-dev \ --index indexes/msmarco-doc/lucene-index-msmarco/ \ --output runs/run.msmarco-doc.leaderboard-dev.elastic.txt \ + --output-format msmarco \ + --hits 100 \ --bm25 --k1 1.2 --b 0.75 \ --fields contents=10.0 title=8.63280262513067 url=0.0 \ --dismax --dismax.tiebreaker 0.3936135232328522 \ @@ -70,7 +82,10 @@ python -m pyserini.search --output-format msmarco --hits 100 \ After the run completes, we can evaluate the results: ```bash -$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run runs/run.msmarco-doc.leaderboard-dev.elastic.txt +$ python -m pyserini.eval.msmarco_doc_eval \ + --judgments msmarco-doc-dev \ + --run runs/run.msmarco-doc.leaderboard-dev.elastic.txt + ##################### MRR @100: 0.3071421845448626 QueriesRanked: 5193 diff --git a/docs/experiments-msmarco-doc.md b/docs/experiments-msmarco-doc.md index 1cd3d05bf..265806903 100644 --- a/docs/experiments-msmarco-doc.md +++ b/docs/experiments-msmarco-doc.md @@ -14,7 +14,6 @@ First, we need to download and extract the MS MARCO document dataset: ``` mkdir collections/msmarco-doc - wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc # Alternative mirror: @@ -27,13 +26,15 @@ There's no need to uncompress the file, as Anserini can directly index gzipped f Build the index with the following command: ``` -python -m pyserini.index -collection CleanTrecCollection \ - -generator DefaultLuceneDocumentGenerator -threads 1 -input collections/msmarco-doc \ - -index indexes/lucene-index-msmarco-doc -storePositions -storeDocvectors -storeRaw +python -m pyserini.index \ + --input collections/msmarco-doc \ + --collection CleanTrecCollection \ + --generator DefaultLuceneDocumentGenerator \ + --index indexes/lucene-index-msmarco-doc \ + --threads 1 \ + --storePositions --storeDocvectors --storeRaw ``` -Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., `-index` and not `--index`. - On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed. @@ -42,7 +43,7 @@ There should be a total of 3,213,835 documents indexed. The 5193 queries in the development set are already stored in the repo. Let's take a peek: -```bash +``` $ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt 174249 does xpress bet charge to deposit money in your account 320792 how much is a cost to run disneyland @@ -54,6 +55,7 @@ $ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt 178627 effects of detox juice cleanse 1101278 do prince harry and william have last names 68095 can hives be a sign of pregnancy + $ wc tools/topics-and-qrels/topics.msmarco-doc.dev.txt 5193 35787 220304 tools/topics-and-qrels/topics.msmarco-doc.dev.txt ``` @@ -63,10 +65,13 @@ Conveniently, Pyserini already knows how to load and iterate through these pairs We can now perform retrieval using these queries: ```bash -python -m pyserini.search --topics msmarco-doc-dev \ - --index indexes/lucene-index-msmarco-doc \ - --output runs/run.msmarco-doc.bm25tuned.txt \ - --bm25 --output-format msmarco --hits 100 --k1 4.46 --b 0.82 +python -m pyserini.search \ + --topics msmarco-doc-dev \ + --index indexes/lucene-index-msmarco-doc \ + --output runs/run.msmarco-doc.bm25tuned.txt \ + --output-format msmarco \ + --hits 100 \ + --bm25 --k1 4.46 --b 0.82 ``` Here, we set the BM25 parameters to `k1=4.46`, `b=0.82` (tuned by grid search). @@ -82,8 +87,10 @@ For example, setting `--threads 16 --batch-size 64` on a CPU with sufficient cor After the run finishes, we can evaluate the results using the official MS MARCO evaluation script: ```bash -$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \ - --run runs/run.msmarco-doc.bm25tuned.txt +$ python tools/scripts/msmarco/msmarco_doc_eval.py \ + --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \ + --run runs/run.msmarco-doc.bm25tuned.txt + ##################### MRR @100: 0.2770296928568702 QueriesRanked: 5193 @@ -95,7 +102,8 @@ For that we first need to convert the run file into TREC format: ```bash $ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-doc.bm25tuned.txt --output runs/run.msmarco-doc.bm25tuned.trec + --input runs/run.msmarco-doc.bm25tuned.txt \ + --output runs/run.msmarco-doc.bm25tuned.trec ``` And then run the `trec_eval` tool: