Skip to content

Commit

Permalink
Update reproduction log (#1356)
Browse files Browse the repository at this point in the history
  • Loading branch information
minconszhang authored Nov 30, 2022
1 parent a3b0631 commit 2287be0
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 4 deletions.
5 changes: 3 additions & 2 deletions docs/experiments-msmarco-doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -

To confirm, `msmarco-docs.trec.gz` should have MD5 checksum of `d4863e4f342982b51b9a8fc668b2d0c0`.

There's no need to uncompress the file, as Anserini can directly index gzipped files.
There's no need to uncompress the file, as Pyserini can directly index gzipped files.
Build the index with the following command:

```bash
Expand Down Expand Up @@ -134,7 +134,7 @@ map all 0.2219
recall_100 all 0.7564
```

We can see that Anserini's (tuned) BM25 baseline is already much better than the baseline provided by the organizers.
We can see that Pyserini's (tuned) BM25 baseline is already much better than the baseline provided by the organizers.

## Reproduction Log[*](reproducibility.md)

Expand Down Expand Up @@ -176,3 +176,4 @@ We can see that Anserini's (tuned) BM25 baseline is already much better than the
+ Results reproduced by [@aivan6842](https://github.com/aivan6842) on 2022-07-11 (commit [`f553d43`](https://github.com/castorini/pyserini/commit/f553d43e5bd0b5617a002f1ab7861a158d6e2e71))
+ Results reproduced by [@Jasonwu-0803](https://github.com/Jasonwu-0803) on 2022-09-27 (commit [`563e4e7`](https://github.com/castorini/pyserini/commit/563e4e7d0daa2869355952663ed3f68955cdefdc))
+ Results reproduced by [@limelody](https://github.com/limelody) on 2022-10-14 (commit [`40ecc7b`](https://github.com/castorini/pyserini/commit/40ecc7bedd8bf26ae9ac6f0cb0358213ce2182f7))
+ Results reproduced by [@minconszhang](https://github.com/minconszhang) on 2022-11-25 (commit [`a3b0631`](https://github.com/castorini/pyserini/commit/a3b06316594859bc56706b711a68a28b9880f49c))
5 changes: 3 additions & 2 deletions docs/experiments-msmarco-passage.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/

To confirm, `collectionandqueries.tar.gz` should have MD5 checksum of `31644046b18952c1386cd4564ba2ae69`.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):
Next, we need to convert the MS MARCO tsv collection into Pyserini's jsonl files (which have one json object per line):

```bash
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
Expand All @@ -35,7 +35,7 @@ python tools/scripts/msmarco/convert_collection_to_jsonl.py \

The above script should generate 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a `JsonCollection` using Anserini:
We can now index these docs as a `JsonCollection` using Pyserini:

```bash
python -m pyserini.index.lucene \
Expand Down Expand Up @@ -176,3 +176,4 @@ On the other hand, recall@1000 provides the upper bound effectiveness of downstr
+ Results reproduced by [@aivan6842](https://github.com/aivan6842) on 2022-07-11 (commit [`f553d43`](https://github.com/castorini/pyserini/commit/f553d43e5bd0b5617a002f1ab7861a158d6e2e71))
+ Results reproduced by [@Jasonwu-0803](https://github.com/Jasonwu-0803) on 2022-09-27 (commit [`563e4e7`](https://github.com/castorini/pyserini/commit/563e4e7d0daa2869355952663ed3f68955cdefdc))
+ Results reproduced by [@limelody](https://github.com/limelody) on 2022-09-27 (commit [`7b53918`](https://github.com/castorini/pyserini/commit/7b5391864897df4523b34a4943ce08d7e373dbe7))
+ Results reproduced by [@minconszhang](https://github.com/minconszhang) on 2022-11-25 (commit [`a3b0631`](https://github.com/castorini/pyserini/commit/a3b06316594859bc56706b711a68a28b9880f49c))

0 comments on commit 2287be0

Please sign in to comment.