Updating MS MARCO Elasticsearch document, minor tweaks in commands (#902

)
castorini · Dec 12, 2021 · 11ce241 · 11ce241
1 parent 0475144
commit 11ce241
Show file tree

Hide file tree

Showing 3 changed files with 59 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -306,19 +306,20 @@ So, the quickest way to get started is to write a script that converts your docu
 Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well):
 
 ```bash
-python -m pyserini.index -collection JsonCollection \
-                         -generator DefaultLuceneDocumentGenerator \
-                         -threads 1 \
-                         -input integrations/resources/sample_collection_jsonl \
-                         -index indexes/sample_collection_jsonl \
-                         -storePositions -storeDocvectors -storeRaw
+python -m pyserini.index \
+  --input integrations/resources/sample_collection_jsonl \
+  --collection JsonCollection \
+  --generator DefaultLuceneDocumentGenerator \
+  --index indexes/sample_collection_jsonl \
+  --threads 1 \
+  --storePositions --storeDocvectors --storeRaw
 ```
 
 Three options control the type of index that is built:
 
-+ `-storePositions`: builds a standard positional index
-+ `-storeDocvectors`: stores doc vectors (required for relevance feedback)
-+ `-storeRaw`: stores raw documents
++ `--storePositions`: builds a standard positional index
++ `--storeDocvectors`: stores doc vectors (required for relevance feedback)
++ `--storeRaw`: stores raw documents
 
 If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies.
 This is sufficient for simple "bag of words" querying (and yields the smallest index size).
@@ -349,12 +350,13 @@ Note that the file extension _must_ end in `.tsv` so that Pyserini knows what fo
 Then, you can run:
 
 ```bash
-$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \
-                            --index indexes/sample_collection_jsonl \
-                            --output run.sample.txt \
-                            --bm25
+$ python -m pyserini.search \
+    --topics integrations/resources/sample_queries.tsv \
+    --index indexes/sample_collection_jsonl \
+    --output run.sample.txt \
+    --bm25
 
-$ cat run.sample.txt 
+$ cat run.sample.txt
 1 Q0 doc2 1 0.256200 Anserini
 1 Q0 doc3 2 0.231400 Anserini
 2 Q0 doc1 1 0.534600 Anserini

diff --git a/docs/experiments-elastic.md b/docs/experiments-elastic.md
@@ -1,5 +1,7 @@
 # Pyserini: Multi-field Baseline for MS MARCO Document Ranking
 
+<!-- NOTE, don't rename this page, because the URL is embedded in the WSDM demo -->
+
 This page contains instructions for reproducing the "Elasticsearch optimized
 multi_match best_fields" entry (2020/11/25) on the the [MS MARCO Document Ranking Leaderboard](https://microsoft.github.io/MSMARCO-Document-Ranking-Submissions/leaderboard/) using Pyserini.
 Details behind this run are described in this [blog post](https://www.elastic.co/blog/improving-search-relevance-with-data-driven-query-optimization);
@@ -23,6 +25,10 @@ First, we need to download and extract the MS MARCO document dataset:
 ```
 mkdir collections/msmarco-doc
 wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz -P collections/msmarco-doc
+
+# Alternative mirror:
+# wget https://www.dropbox.com/s/zly8cbyvt18l3u0/msmarco-docs.tsv.gz -P collections/msmarco-doc
+
 gunzip collections/msmarco-doc/msmarco-docs.tsv.gz
 ```
 
@@ -40,10 +46,14 @@ python tools/scripts/msmarco/convert_doc_collection_to_jsonl.py \
 We then build the index with the following command:
 
 ```bash
-python -m pyserini.index -threads 4 -collection JsonCollection \
-  -generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc-json/ \
-  -index indexes/msmarco-doc/lucene-index-msmarco -storeRaw \
-  -stopwords docs/elastic-msmarco-stopwords.txt
+python -m pyserini.index \
+  --input collections/msmarco-doc-json/ \
+  --collection JsonCollection \
+  --generator DefaultLuceneDocumentGenerator \
+  --index indexes/msmarco-doc/lucene-index-msmarco \
+  --threads 4 \
+  --storeRaw \
+  --stopwords docs/elastic-msmarco-stopwords.txt
 ```
 
 On a modern desktop with an SSD, indexing takes around 15 minutes.
@@ -57,10 +67,12 @@ attention to: the official metric is MRR@100, so we want to only return the top
 format.
 
 ```bash
-python -m pyserini.search --output-format msmarco --hits 100 \
+python -m pyserini.search \
   --topics msmarco-doc-dev \
   --index indexes/msmarco-doc/lucene-index-msmarco/ \
   --output runs/run.msmarco-doc.leaderboard-dev.elastic.txt \
+  --output-format msmarco \
+  --hits 100 \
   --bm25 --k1 1.2 --b 0.75 \
   --fields contents=10.0 title=8.63280262513067 url=0.0 \
   --dismax --dismax.tiebreaker 0.3936135232328522 \
@@ -70,7 +82,10 @@ python -m pyserini.search --output-format msmarco --hits 100 \
 After the run completes, we can evaluate the results:
 
 ```bash
-$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run runs/run.msmarco-doc.leaderboard-dev.elastic.txt
+$ python -m pyserini.eval.msmarco_doc_eval \
+    --judgments msmarco-doc-dev \
+    --run runs/run.msmarco-doc.leaderboard-dev.elastic.txt
+
 #####################
 MRR @100: 0.3071421845448626
 QueriesRanked: 5193

diff --git a/docs/experiments-msmarco-doc.md b/docs/experiments-msmarco-doc.md
@@ -14,7 +14,6 @@ First, we need to download and extract the MS MARCO document dataset:
 
 ```
 mkdir collections/msmarco-doc
-
 wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
 
 # Alternative mirror:
@@ -27,13 +26,15 @@ There's no need to uncompress the file, as Anserini can directly index gzipped f
 Build the index with the following command:
 
 ```
-python -m pyserini.index -collection CleanTrecCollection \
- -generator DefaultLuceneDocumentGenerator -threads 1 -input collections/msmarco-doc \
- -index indexes/lucene-index-msmarco-doc -storePositions -storeDocvectors -storeRaw
+python -m pyserini.index \
+  --input collections/msmarco-doc \
+  --collection CleanTrecCollection \
+  --generator DefaultLuceneDocumentGenerator \
+  --index indexes/lucene-index-msmarco-doc \
+  --threads 1 \
+  --storePositions --storeDocvectors --storeRaw
 ```
 
-Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., `-index` and not `--index`.
-
 On a modern desktop with an SSD, indexing takes around 40 minutes.
 There should be a total of 3,213,835 documents indexed.
 
@@ -42,7 +43,7 @@ There should be a total of 3,213,835 documents indexed.
 The 5193 queries in the development set are already stored in the repo.
 Let's take a peek:
 
-```bash
+```
 $ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt
 174249	does xpress bet charge to deposit money in your account
 320792	how much is a cost to run disneyland
@@ -54,6 +55,7 @@ $ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt
 178627	effects of detox juice cleanse
 1101278	do prince harry and william have last names
 68095	can hives be a sign of pregnancy
+
 $ wc tools/topics-and-qrels/topics.msmarco-doc.dev.txt
     5193   35787  220304 tools/topics-and-qrels/topics.msmarco-doc.dev.txt
 ```
@@ -63,10 +65,13 @@ Conveniently, Pyserini already knows how to load and iterate through these pairs
 We can now perform retrieval using these queries:
 
 ```bash
-python -m pyserini.search --topics msmarco-doc-dev \
- --index indexes/lucene-index-msmarco-doc \
- --output runs/run.msmarco-doc.bm25tuned.txt \
- --bm25 --output-format msmarco --hits 100 --k1 4.46 --b 0.82
+python -m pyserini.search \
+  --topics msmarco-doc-dev \
+  --index indexes/lucene-index-msmarco-doc \
+  --output runs/run.msmarco-doc.bm25tuned.txt \
+  --output-format msmarco \
+  --hits 100 \
+  --bm25 --k1 4.46 --b 0.82
 ```
 
 Here, we set the BM25 parameters to `k1=4.46`, `b=0.82` (tuned by grid search).
@@ -82,8 +87,10 @@ For example, setting `--threads 16 --batch-size 64` on a CPU with sufficient cor
 After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:
 
 ```bash
-$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
-   --run runs/run.msmarco-doc.bm25tuned.txt
+$ python tools/scripts/msmarco/msmarco_doc_eval.py \
+    --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --run runs/run.msmarco-doc.bm25tuned.txt
+
 #####################
 MRR @100: 0.2770296928568702
 QueriesRanked: 5193
@@ -95,7 +102,8 @@ For that we first need to convert the run file into TREC format:
 
 ```bash
 $ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
-   --input runs/run.msmarco-doc.bm25tuned.txt --output runs/run.msmarco-doc.bm25tuned.trec
+    --input runs/run.msmarco-doc.bm25tuned.txt \
+    --output runs/run.msmarco-doc.bm25tuned.trec
 ```
 
 And then run the `trec_eval` tool: