Skip to content

Commit

Permalink
Updating MS MARCO Elasticsearch document, minor tweaks in commands (#902
Browse files Browse the repository at this point in the history
)
  • Loading branch information
lintool authored Dec 12, 2021
1 parent 0475144 commit 11ce241
Show file tree
Hide file tree
Showing 3 changed files with 59 additions and 34 deletions.
30 changes: 16 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,19 +306,20 @@ So, the quickest way to get started is to write a script that converts your docu
Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well):

```bash
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 1 \
-input integrations/resources/sample_collection_jsonl \
-index indexes/sample_collection_jsonl \
-storePositions -storeDocvectors -storeRaw
python -m pyserini.index \
--input integrations/resources/sample_collection_jsonl \
--collection JsonCollection \
--generator DefaultLuceneDocumentGenerator \
--index indexes/sample_collection_jsonl \
--threads 1 \
--storePositions --storeDocvectors --storeRaw
```

Three options control the type of index that is built:

+ `-storePositions`: builds a standard positional index
+ `-storeDocvectors`: stores doc vectors (required for relevance feedback)
+ `-storeRaw`: stores raw documents
+ `--storePositions`: builds a standard positional index
+ `--storeDocvectors`: stores doc vectors (required for relevance feedback)
+ `--storeRaw`: stores raw documents

If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies.
This is sufficient for simple "bag of words" querying (and yields the smallest index size).
Expand Down Expand Up @@ -349,12 +350,13 @@ Note that the file extension _must_ end in `.tsv` so that Pyserini knows what fo
Then, you can run:

```bash
$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \
--index indexes/sample_collection_jsonl \
--output run.sample.txt \
--bm25
$ python -m pyserini.search \
--topics integrations/resources/sample_queries.tsv \
--index indexes/sample_collection_jsonl \
--output run.sample.txt \
--bm25

$ cat run.sample.txt
$ cat run.sample.txt
1 Q0 doc2 1 0.256200 Anserini
1 Q0 doc3 2 0.231400 Anserini
2 Q0 doc1 1 0.534600 Anserini
Expand Down
27 changes: 21 additions & 6 deletions docs/experiments-elastic.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Pyserini: Multi-field Baseline for MS MARCO Document Ranking

<!-- NOTE, don't rename this page, because the URL is embedded in the WSDM demo -->

This page contains instructions for reproducing the "Elasticsearch optimized
multi_match best_fields" entry (2020/11/25) on the the [MS MARCO Document Ranking Leaderboard](https://microsoft.github.io/MSMARCO-Document-Ranking-Submissions/leaderboard/) using Pyserini.
Details behind this run are described in this [blog post](https://www.elastic.co/blog/improving-search-relevance-with-data-driven-query-optimization);
Expand All @@ -23,6 +25,10 @@ First, we need to download and extract the MS MARCO document dataset:
```
mkdir collections/msmarco-doc
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz -P collections/msmarco-doc
# Alternative mirror:
# wget https://www.dropbox.com/s/zly8cbyvt18l3u0/msmarco-docs.tsv.gz -P collections/msmarco-doc
gunzip collections/msmarco-doc/msmarco-docs.tsv.gz
```

Expand All @@ -40,10 +46,14 @@ python tools/scripts/msmarco/convert_doc_collection_to_jsonl.py \
We then build the index with the following command:

```bash
python -m pyserini.index -threads 4 -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator -input collections/msmarco-doc-json/ \
-index indexes/msmarco-doc/lucene-index-msmarco -storeRaw \
-stopwords docs/elastic-msmarco-stopwords.txt
python -m pyserini.index \
--input collections/msmarco-doc-json/ \
--collection JsonCollection \
--generator DefaultLuceneDocumentGenerator \
--index indexes/msmarco-doc/lucene-index-msmarco \
--threads 4 \
--storeRaw \
--stopwords docs/elastic-msmarco-stopwords.txt
```

On a modern desktop with an SSD, indexing takes around 15 minutes.
Expand All @@ -57,10 +67,12 @@ attention to: the official metric is MRR@100, so we want to only return the top
format.

```bash
python -m pyserini.search --output-format msmarco --hits 100 \
python -m pyserini.search \
--topics msmarco-doc-dev \
--index indexes/msmarco-doc/lucene-index-msmarco/ \
--output runs/run.msmarco-doc.leaderboard-dev.elastic.txt \
--output-format msmarco \
--hits 100 \
--bm25 --k1 1.2 --b 0.75 \
--fields contents=10.0 title=8.63280262513067 url=0.0 \
--dismax --dismax.tiebreaker 0.3936135232328522 \
Expand All @@ -70,7 +82,10 @@ python -m pyserini.search --output-format msmarco --hits 100 \
After the run completes, we can evaluate the results:

```bash
$ python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run runs/run.msmarco-doc.leaderboard-dev.elastic.txt
$ python -m pyserini.eval.msmarco_doc_eval \
--judgments msmarco-doc-dev \
--run runs/run.msmarco-doc.leaderboard-dev.elastic.txt

#####################
MRR @100: 0.3071421845448626
QueriesRanked: 5193
Expand Down
36 changes: 22 additions & 14 deletions docs/experiments-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ First, we need to download and extract the MS MARCO document dataset:

```
mkdir collections/msmarco-doc
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
# Alternative mirror:
Expand All @@ -27,13 +26,15 @@ There's no need to uncompress the file, as Anserini can directly index gzipped f
Build the index with the following command:

```
python -m pyserini.index -collection CleanTrecCollection \
-generator DefaultLuceneDocumentGenerator -threads 1 -input collections/msmarco-doc \
-index indexes/lucene-index-msmarco-doc -storePositions -storeDocvectors -storeRaw
python -m pyserini.index \
--input collections/msmarco-doc \
--collection CleanTrecCollection \
--generator DefaultLuceneDocumentGenerator \
--index indexes/lucene-index-msmarco-doc \
--threads 1 \
--storePositions --storeDocvectors --storeRaw
```

Note that the indexing program simply dispatches command-line arguments to an underlying Java program, and so we use the Java single dash convention, e.g., `-index` and not `--index`.

On a modern desktop with an SSD, indexing takes around 40 minutes.
There should be a total of 3,213,835 documents indexed.

Expand All @@ -42,7 +43,7 @@ There should be a total of 3,213,835 documents indexed.
The 5193 queries in the development set are already stored in the repo.
Let's take a peek:

```bash
```
$ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt
174249 does xpress bet charge to deposit money in your account
320792 how much is a cost to run disneyland
Expand All @@ -54,6 +55,7 @@ $ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt
178627 effects of detox juice cleanse
1101278 do prince harry and william have last names
68095 can hives be a sign of pregnancy
$ wc tools/topics-and-qrels/topics.msmarco-doc.dev.txt
5193 35787 220304 tools/topics-and-qrels/topics.msmarco-doc.dev.txt
```
Expand All @@ -63,10 +65,13 @@ Conveniently, Pyserini already knows how to load and iterate through these pairs
We can now perform retrieval using these queries:

```bash
python -m pyserini.search --topics msmarco-doc-dev \
--index indexes/lucene-index-msmarco-doc \
--output runs/run.msmarco-doc.bm25tuned.txt \
--bm25 --output-format msmarco --hits 100 --k1 4.46 --b 0.82
python -m pyserini.search \
--topics msmarco-doc-dev \
--index indexes/lucene-index-msmarco-doc \
--output runs/run.msmarco-doc.bm25tuned.txt \
--output-format msmarco \
--hits 100 \
--bm25 --k1 4.46 --b 0.82
```

Here, we set the BM25 parameters to `k1=4.46`, `b=0.82` (tuned by grid search).
Expand All @@ -82,8 +87,10 @@ For example, setting `--threads 16 --batch-size 64` on a CPU with sufficient cor
After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:

```bash
$ python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.bm25tuned.txt
$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.bm25tuned.txt

#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
Expand All @@ -95,7 +102,8 @@ For that we first need to convert the run file into TREC format:

```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
--input runs/run.msmarco-doc.bm25tuned.txt --output runs/run.msmarco-doc.bm25tuned.trec
--input runs/run.msmarco-doc.bm25tuned.txt \
--output runs/run.msmarco-doc.bm25tuned.trec
```

And then run the `trec_eval` tool:
Expand Down

0 comments on commit 11ce241

Please sign in to comment.