diff --git a/docs/experiments-dpr.md b/docs/experiments-dpr.md index 3da0f383e..546a350f2 100644 --- a/docs/experiments-dpr.md +++ b/docs/experiments-dpr.md @@ -7,14 +7,17 @@ Dense passage retriever (DPR) is a dense retrieval method described in the follo We have replicated DPR results and incorporated the technique into Pyserini. Our own efforts are described in the following paper: +> Xueguang Ma, Kai Sun, Ronak Pradeep, Minghan Li, and Jimmy Lin. [Another Look at DPR: Reproduction of Training and Replication of Retrieval](https://link.springer.com/chapter/10.1007/978-3-030-99736-6_41). Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I, pages 613-626, April 2021, Stavanger, Norway. + +Which evolved from a previous arXiv preprint: + > Xueguang Ma, Kai Sun, Ronak Pradeep, and Jimmy Lin. [A Replication Study of Dense Passage Retriever](https://arxiv.org/abs/2104.05740). _arXiv:2104.05740_, April 2021. To be clear, we started with model checkpoint releases in the official [DPR repo](https://github.com/facebookresearch/DPR) and did _not_ retrain the query and passage encoders from scratch. Our implementation does not share any code with the DPR repo, other than evaluation scripts to ensure that results are comparable. This guide provides instructions to reproduce our replication study. -Our efforts include both retrieval as well as end-to-end answer extraction. -We cover only retrieval here; for end-to-end answer extraction, please see [this guide](https://github.com/castorini/pygaggle/blob/master/docs/experiments-dpr-reader.md) in our PyGaggle neural text ranking library. +Our efforts include both retrieval and end-to-end answer extraction, but we only cover retrieval here. Note that we often observe minor differences in scores between different computing environments (e.g., Linux vs. macOS). However, the differences usually appear in the fifth digit after the decimal point, and do not appear to be a cause for concern from a reproducibility perspective. @@ -43,6 +46,7 @@ Here's how our results stack up against results reported in the paper using the | SQuAD | Hybrid | 66.2 | 75.1 | 78.6 | 84.4 | The hybrid results reported above for "us" capture what we call the "norm" condition (see paper for details). +Note that the results below represent the current state of the code base, where there may be minor differences in effectiveness from what's reported in the paper. ## Natural Questions (NQ) with DPR-Multi @@ -53,8 +57,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-multi \ --topics dpr-nq-test \ --encoded-queries dpr_multi-nq-test \ - --output runs/run.dpr.nq-test.multi.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.nq-test.multi.trec \ + --batch-size 512 --threads 16 ``` The option `--encoded-queries` specifies the use of encoded queries (i.e., queries that have already been converted into dense vectors and cached). @@ -66,11 +70,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --input runs/run.dpr.nq-test.multi.bf.trec \ - --output runs/run.dpr.nq-test.multi.bf.json + --input runs/run.encoded.dpr.nq-test.multi.trec \ + --output runs/run.encoded.dpr.nq-test.multi.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.nq-test.multi.bf.json \ + --retrieval runs/run.encoded.dpr.nq-test.multi.json \ --topk 20 100 ``` @@ -87,7 +91,7 @@ Top100 accuracy: 0.8609 python -m pyserini.search.lucene \ --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --output runs/run.dpr.nq-test.bm25.trec + --output runs/run.encoded.dpr.nq-test.bm25.trec ``` To evaluate, first convert the TREC output format to DPR's `json` format: @@ -96,11 +100,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --input runs/run.dpr.nq-test.bm25.trec \ - --output runs/run.dpr.nq-test.bm25.json + --input runs/run.encoded.dpr.nq-test.bm25.trec \ + --output runs/run.encoded.dpr.nq-test.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.nq-test.bm25.json \ + --retrieval runs/run.encoded.dpr.nq-test.bm25.json \ --topk 20 100 ``` @@ -120,8 +124,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 1.3 \ run --topics dpr-nq-test \ - --output runs/run.dpr.nq-test.multi.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.nq-test.multi.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` with `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -132,11 +136,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --input runs/run.dpr.nq-test.multi.bf.bm25.trec \ - --output runs/run.dpr.nq-test.multi.bf.bm25.json + --input runs/run.encoded.dpr.nq-test.multi.bm25.trec \ + --output runs/run.encoded.dpr.nq-test.multi.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.nq-test.multi.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.nq-test.multi.bm25.json \ --topk 20 100 ``` @@ -156,8 +160,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-multi \ --topics dpr-trivia-test \ --encoded-queries dpr_multi-trivia-test \ - --output runs/run.dpr.trivia-test.multi.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.trivia-test.multi.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` with `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -168,11 +172,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-trivia-test \ - --input runs/run.dpr.trivia-test.multi.bf.trec \ - --output runs/run.dpr.trivia-test.multi.bf.json + --input runs/run.encoded.dpr.trivia-test.multi.trec \ + --output runs/run.encoded.dpr.trivia-test.multi.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.trivia-test.multi.bf.json \ + --retrieval runs/run.encoded.dpr.trivia-test.multi.json \ --topk 20 100 ``` @@ -189,7 +193,7 @@ Top100 accuracy: 0.8479 python -m pyserini.search.lucene \ --index wikipedia-dpr-100w \ --topics dpr-trivia-test \ - --output runs/run.dpr.trivia-test.bm25.trec + --output runs/run.encoded.dpr.trivia-test.bm25.trec ``` To evaluate, first convert the TREC output format to DPR's `json` format: @@ -198,11 +202,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-trivia-test \ - --input runs/run.dpr.trivia-test.bm25.trec \ - --output runs/run.dpr.trivia-test.bm25.json + --input runs/run.encoded.dpr.trivia-test.bm25.trec \ + --output runs/run.encoded.dpr.trivia-test.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.trivia-test.bm25.json \ + --retrieval runs/run.encoded.dpr.trivia-test.bm25.json \ --topk 20 100 ``` @@ -222,8 +226,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 0.95 \ run --topics dpr-trivia-test \ - --output runs/run.dpr.trivia-test.multi.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.trivia-test.multi.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` with `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -234,11 +238,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-trivia-test \ - --input runs/run.dpr.trivia-test.multi.bf.bm25.trec \ - --output runs/run.dpr.trivia-test.multi.bf.bm25.json + --input runs/run.encoded.dpr.trivia-test.multi.bm25.trec \ + --output runs/run.encoded.dpr.trivia-test.multi.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.trivia-test.multi.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.trivia-test.multi.bm25.json \ --topk 20 100 ``` @@ -258,8 +262,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-multi \ --topics dpr-wq-test \ --encoded-queries dpr_multi-wq-test \ - --output runs/run.dpr.wq-test.multi.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.wq-test.multi.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` with `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -270,11 +274,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-wq-test \ - --input runs/run.dpr.wq-test.multi.bf.trec \ - --output runs/run.dpr.wq-test.multi.bf.json + --input runs/run.encoded.dpr.wq-test.multi.trec \ + --output runs/run.encoded.dpr.wq-test.multi.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.wq-test.multi.bf.json \ + --retrieval runs/run.encoded.dpr.wq-test.multi.json \ --topk 20 100 ``` @@ -291,7 +295,7 @@ Top100 accuracy: 0.8297 python -m pyserini.search.lucene \ --index wikipedia-dpr-100w \ --topics dpr-wq-test \ - --output runs/run.dpr.wq-test.bm25.trec + --output runs/run.encoded.dpr.wq-test.bm25.trec ``` To evaluate, first convert the TREC output format to DPR's `json` format: @@ -300,11 +304,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-wq-test \ - --input runs/run.dpr.wq-test.bm25.trec \ - --output runs/run.dpr.wq-test.bm25.json + --input runs/run.encoded.dpr.wq-test.bm25.trec \ + --output runs/run.encoded.dpr.wq-test.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.wq-test.bm25.json \ + --retrieval runs/run.encoded.dpr.wq-test.bm25.json \ --topk 20 100 ``` @@ -324,8 +328,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 0.95 \ run --topics dpr-wq-test \ - --output runs/run.dpr.wq-test.multi.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.wq-test.multi.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` with `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -336,11 +340,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-wq-test \ - --input runs/run.dpr.wq-test.multi.bf.bm25.trec \ - --output runs/run.dpr.wq-test.multi.bf.bm25.json + --input runs/run.encoded.dpr.wq-test.multi.bm25.trec \ + --output runs/run.encoded.dpr.wq-test.multi.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.wq-test.multi.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.wq-test.multi.bm25.json \ --topk 20 100 ``` @@ -360,8 +364,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-multi \ --topics dpr-curated-test \ --encoded-queries dpr_multi-curated-test \ - --output runs/run.dpr.curated-test.multi.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.curated-test.multi.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-multiset-base` with for on-the-fly query encoding. @@ -372,12 +376,12 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-curated-test \ - --input runs/run.dpr.curated-test.multi.bf.trec \ - --output runs/run.dpr.curated-test.multi.bf.json \ + --input runs/run.encoded.dpr.curated-test.multi.trec \ + --output runs/run.encoded.dpr.curated-test.multi.json \ --regex python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.curated-test.multi.bf.json \ + --retrieval runs/run.encoded.dpr.curated-test.multi.json \ --topk 20 100 \ --regex ``` @@ -395,7 +399,7 @@ Top100 accuracy: 0.9337 python -m pyserini.search.lucene \ --index wikipedia-dpr-100w \ --topics dpr-curated-test \ - --output runs/run.dpr.curated-test.bm25.trec + --output runs/run.encoded.dpr.curated-test.bm25.trec ``` To evaluate, first convert the TREC output format to DPR's `json` format: @@ -404,12 +408,12 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-curated-test \ - --input runs/run.dpr.curated-test.bm25.trec \ - --output runs/run.dpr.curated-test.bm25.json \ + --input runs/run.encoded.dpr.curated-test.bm25.trec \ + --output runs/run.encoded.dpr.curated-test.bm25.json \ --regex python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.curated-test.bm25.json \ + --retrieval runs/run.encoded.dpr.curated-test.bm25.json \ --topk 20 100 \ --regex ``` @@ -430,8 +434,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 1.05 \ run --topics dpr-curated-test \ - --output runs/run.dpr.curated-test.multi.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.curated-test.multi.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -442,12 +446,12 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-curated-test \ - --input runs/run.dpr.curated-test.multi.bf.bm25.trec \ - --output runs/run.dpr.curated-test.multi.bf.bm25.json \ + --input runs/run.encoded.dpr.curated-test.multi.bm25.trec \ + --output runs/run.encoded.dpr.curated-test.multi.bm25.json \ --regex python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.curated-test.multi.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.curated-test.multi.bm25.json \ --topk 20 100 \ --regex ``` @@ -468,8 +472,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-multi \ --topics dpr-squad-test \ --encoded-queries dpr_multi-squad-test \ - --output runs/run.dpr.squad-test.multi.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.squad-test.multi.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -480,11 +484,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-squad-test \ - --input runs/run.dpr.squad-test.multi.bf.trec \ - --output runs/run.dpr.squad-test.multi.bf.json + --input runs/run.encoded.dpr.squad-test.multi.trec \ + --output runs/run.encoded.dpr.squad-test.multi.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.squad-test.multi.bf.json \ + --retrieval runs/run.encoded.dpr.squad-test.multi.json \ --topk 20 100 ``` @@ -501,7 +505,7 @@ Top100 accuracy: 0.6773 python -m pyserini.search.lucene \ --index wikipedia-dpr-100w \ --topics dpr-squad-test \ - --output runs/run.dpr.squad-test.bm25.trec + --output runs/run.encoded.dpr.squad-test.bm25.trec ``` To evaluate, first convert the TREC output format to DPR's `json` format: @@ -510,11 +514,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-squad-test \ - --input runs/run.dpr.squad-test.bm25.trec \ - --output runs/run.dpr.squad-test.bm25.json + --input runs/run.encoded.dpr.squad-test.bm25.trec \ + --output runs/run.encoded.dpr.squad-test.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.squad-test.bm25.json \ + --retrieval runs/run.encoded.dpr.squad-test.bm25.json \ --topk 20 100 ``` @@ -534,8 +538,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 2.00 \ run --topics dpr-squad-test \ - --output runs/run.dpr.squad-test.multi.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.squad-test.multi.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-multiset-base` for on-the-fly query encoding. @@ -546,19 +550,19 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-squad-test \ - --input runs/run.dpr.squad-test.multi.bf.bm25.trec \ - --output runs/run.dpr.squad-test.multi.bf.bm25.json + --input runs/run.encoded.dpr.squad-test.multi.bm25.trec \ + --output runs/run.encoded.dpr.squad-test.multi.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.squad-test.multi.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.squad-test.multi.bm25.json \ --topk 20 100 ``` And the expected results: ``` -Top20 accuracy: 0.7514 -Top100 accuracy: 0.8437 +Top20 accuracy: 0.7513 +Top100 accuracy: 0.8436 ``` ## Natural Questions (NQ) with DPR-Single @@ -570,8 +574,8 @@ python -m pyserini.search.faiss \ --index wikipedia-dpr-100w.dpr-single-nq \ --topics dpr-nq-test \ --encoded-queries dpr_single_nq-nq-test \ - --output runs/run.dpr.nq-test.single.bf.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.nq-test.single.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-single-nq-base` for on-the-fly query encoding. @@ -582,11 +586,11 @@ To evaluate, first convert the TREC output format to DPR's `json` format: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --input runs/run.dpr.nq-test.single.bf.trec \ - --output runs/run.dpr.nq-test.single.bf.json + --input runs/run.encoded.dpr.nq-test.single.trec \ + --output runs/run.encoded.dpr.nq-test.single.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.nq-test.single.bf.json \ + --retrieval runs/run.encoded.dpr.nq-test.single.json \ --topk 20 100 ``` @@ -606,8 +610,8 @@ python -m pyserini.search.hybrid \ sparse --index wikipedia-dpr-100w \ fusion --alpha 1.2 \ run --topics dpr-nq-test \ - --output runs/run.dpr.nq-test.single.bf.bm25.trec \ - --batch-size 36 --threads 12 + --output runs/run.encoded.dpr.nq-test.single.bm25.trec \ + --batch-size 512 --threads 16 ``` Same as above, replace `--encoded-queries` by `--encoder facebook/dpr-question_encoder-single-nq-base` for on-the-fly query encoding. @@ -616,13 +620,13 @@ To evaluate, first convert the TREC output format to DPR's `json` format: ```bash python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \ - --index wikipedia-dpr-100w \ --topics dpr-nq-test \ - --input runs/run.dpr.nq-test.single.bf.bm25.trec \ - --output runs/run.dpr.nq-test.single.bf.bm25.json + --index wikipedia-dpr-100w \ + --input runs/run.encoded.dpr.nq-test.single.bm25.trec \ + --output runs/run.encoded.dpr.nq-test.single.bm25.json python -m pyserini.eval.evaluate_dpr_retrieval \ - --retrieval runs/run.dpr.nq-test.single.bf.bm25.json \ + --retrieval runs/run.encoded.dpr.nq-test.single.bm25.json \ --topk 20 100 ``` diff --git a/docs/experiments-tct_colbert-v2.md b/docs/experiments-tct_colbert-v2.md index 33e77f708..aefcb8dbb 100644 --- a/docs/experiments-tct_colbert-v2.md +++ b/docs/experiments-tct_colbert-v2.md @@ -16,8 +16,8 @@ Summary of results (figures from the paper are in parentheses): |:--------------------------------------------------------------|---------------:|-------:|------------:| | TCT_ColBERT-V2 (brute-force index) | 0.3440 (0.344) | 0.3509 | 0.9670 | | TCT_ColBERT-V2-HN (brute-force index) | 0.3543 (0.354) | 0.3608 | 0.9708 | -| TCT_ColBERT-V2-HN+ (brute-force index) | 0.3585 (0.359) | 0.3645 | 0.9695 | -| TCT_ColBERT-V2-HN+ (brute-force index) + BoW BM25 | 0.3683 (0.369) | 0.3737 | 0.9707 | +| TCT_ColBERT-V2-HN+ (brute-force index) | 0.3584 (0.359) | 0.3644 | 0.9695 | +| TCT_ColBERT-V2-HN+ (brute-force index) + BoW BM25 | 0.3682 (0.369) | 0.3737 | 0.9707 | | TCT_ColBERT-V2-HN+ (brute-force index) + BM25 w/ doc2query-T5 | 0.3731 (0.375) | 0.3789 | 0.9759 | The slight differences between the reproduced scores and those reported in the paper can be attributed to TensorFlow implementations in the published paper vs. PyTorch implementations here in this reproduction guide. @@ -31,9 +31,9 @@ python -m pyserini.search.faiss \ --index msmarco-v1-passage.tct_colbert-v2 \ --topics msmarco-passage-dev-subset \ --encoded-queries tct_colbert-v2-msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert-v2.bf.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` Note that to ensure maximum reproducibility, by default Pyserini uses pre-computed query representations that are automatically downloaded. @@ -42,9 +42,13 @@ As an alternative, replace with `--encoder castorini/tct_colbert-v2-msmarco` to To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2.bf.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2.tsv +``` + +Results: +``` ##################### MRR @10: 0.3440 QueriesRanked: 6980 @@ -55,13 +59,17 @@ We can also use the official TREC evaluation tool `trec_eval` to compute other m For that we first need to convert runs and qrels files to the TREC format: ```bash -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert-v2.bf.tsv \ - --output runs/run.msmarco-passage.tct_colbert-v2.bf.trec +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert-v2.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2.trec -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2.bf.trec +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2.trec +``` +Results: + +``` map all 0.3509 recall_1000 all 0.9670 ``` @@ -75,9 +83,9 @@ python -m pyserini.search.faiss \ --index msmarco-v1-passage.tct_colbert-v2-hn \ --topics msmarco-passage-dev-subset \ --encoded-queries tct_colbert-v2-hn-msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hn.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` Note that to ensure maximum reproducibility, by default Pyserini uses pre-computed query representations that are automatically downloaded. @@ -86,21 +94,33 @@ As an alternative, replace with `--encoder castorini/tct_colbert-v2-hn-msmarco` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hn.tsv +``` +Results: + +``` ##################### MRR @10: 0.3543 QueriesRanked: 6980 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \ - --output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert-v2-hn.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hn.trec + +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hn.trec +``` + +Results: +``` map all 0.3608 recall_1000 all 0.9708 ``` @@ -114,9 +134,9 @@ python -m pyserini.search.faiss \ --index msmarco-v1-passage.tct_colbert-v2-hnp \ --topics msmarco-passage-dev-subset \ --encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` Note that to ensure maximum reproducibility, by default Pyserini uses pre-computed query representations that are automatically downloaded. @@ -125,22 +145,34 @@ As an alternative, replace with `--encoder castorini/tct_colbert-v2-hnp-msmarco` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.tsv +``` +Results: + +``` ##################### -MRR @10: 0.3585 +MRR @10: 0.3584 QueriesRanked: 6980 ##################### +``` + +And TREC evaluation: + +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert-v2-hnp.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.trec -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.trec +``` -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec +Results: -map all 0.3645 +``` +map all 0.3644 recall_1000 all 0.9695 ``` @@ -148,7 +180,7 @@ recall_1000 all 0.9695 Hybrid retrieval with dense-sparse representations (without document expansion): - dense retrieval with TCT-ColBERT, brute force index. -- sparse retrieval with BM25 `msmarco-passage` (i.e., default bag-of-words) index. +- sparse retrieval with BM25 (i.e., default bag-of-words) index. ```bash python -m pyserini.search.hybrid \ @@ -158,28 +190,40 @@ python -m pyserini.search.hybrid \ fusion --alpha 0.06 \ run --topics msmarco-passage-dev-subset \ --output-format msmarco \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv \ - --batch-size 36 --threads 12 + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bm25.tsv \ + --batch-size 512 --threads 16 ``` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.bm25.tsv +``` +Results: + +``` ##################### -MRR @10: 0.3683 +MRR @10: 0.3682 QueriesRanked: 6980 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.tsv \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.bm25.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bm25.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bm25.trec +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.bm25.trec +``` + +Results: + +``` map all 0.3737 recall_1000 all 0.9707 ``` @@ -194,32 +238,44 @@ Hybrid retrieval with dense-sparse representations (with document expansion): python -m pyserini.search.hybrid \ dense --index msmarco-v1-passage.tct_colbert-v2-hnp \ --encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \ - sparse --index msmarco-v1-passage-d2q-t5 \ + sparse --index msmarco-v1-passage.d2q-t5 \ fusion --alpha 0.1 \ run --topics msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.doc2queryT5.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.doc2queryT5.tsv +``` + +Results: +``` ##################### MRR @10: 0.3731 QueriesRanked: 6980 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.tsv \ - --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.doc2queryT5.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert-v2-hnp.doc2queryT5.tsv \ + --output runs/run.msmarco-passage.tct_colbert-v2-hnp.doc2queryT5.trec +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert-v2-hnp.doc2queryT5.trec +``` + +Results: + +``` map all 0.3789 recall_1000 all 0.9759 ``` @@ -242,7 +298,7 @@ python -m pyserini.search.faiss \ --hits 1000 \ --max-passage \ --max-passage-hits 100 \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 # TREC 2019 DL queries python -m pyserini.search.faiss \ @@ -253,7 +309,7 @@ python -m pyserini.search.faiss \ --hits 1000 \ --max-passage \ --max-passage-hits 100 \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 # TREC 2020 DL queries python -m pyserini.search.faiss \ @@ -264,52 +320,71 @@ python -m pyserini.search.faiss \ --hits 1000 \ --max-passage \ --max-passage-hits 100 \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` Evaluation on MS MARCO doc queries (dev set): ```bash -$ python -m pyserini.eval.msmarco_doc_eval \ - --judgments msmarco-doc-dev \ - --run runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.txt +python -m pyserini.eval.msmarco_doc_eval \ + --judgments msmarco-doc-dev \ + --run runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.txt +``` +Results: + +``` ##################### -MRR @100: 0.3509 +MRR @100: 0.3512 QueriesRanked: 5193 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.txt \ - --output runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -m recall.100 -m map -m ndcg_cut.10 \ - msmarco-doc-dev runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.txt \ + --output runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.trec + +python -m pyserini.eval.trec_eval -c -m recall.100 -m map -m ndcg_cut.10 \ + msmarco-doc-dev runs/run.msmarco-doc.passage.tct_colbert-v2-hnp-maxp.trec +``` Results: -map all 0.3509 -recall_100 all 0.8908 -ndcg_cut_10 all 0.4123 + +``` +map all 0.3512 +recall_100 all 0.8910 +ndcg_cut_10 all 0.4128 ``` -Evaluation TREC 2019 DL queries: +Evaluation on TREC 2019 DL queries: ```bash -$ python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap -mndcg_cut.10 dl19-doc \ - runs/run.dl19-doc.passage.tct_colbert-v2-hnp-maxp.txt +python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap -mndcg_cut.10 dl19-doc \ + runs/run.dl19-doc.passage.tct_colbert-v2-hnp-maxp.txt +``` + +Results: +``` Results: map all 0.2684 recall_100 all 0.3854 ndcg_cut_10 all 0.6593 ``` -Evaluation TREC 2020 DL queries: +Evaluation on TREC 2020 DL queries: ```bash -$ python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap -mndcg_cut.10 dl20-doc \ - runs/run.dl20-doc.passage.tct_colbert-v2-hnp-maxp.txt +python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap -mndcg_cut.10 dl20-doc \ + runs/run.dl20-doc.passage.tct_colbert-v2-hnp-maxp.txt +``` + +Results: +``` Results: map all 0.3914 recall_100 all 0.5964 diff --git a/docs/experiments-tct_colbert.md b/docs/experiments-tct_colbert.md index b93e53fdf..757b0c3d7 100644 --- a/docs/experiments-tct_colbert.md +++ b/docs/experiments-tct_colbert.md @@ -28,9 +28,9 @@ python -m pyserini.search.faiss \ --index msmarco-v1-passage.tct_colbert \ --topics msmarco-passage-dev-subset \ --encoded-queries tct_colbert-msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert.bf.tsv \ + --output runs/run.msmarco-passage.tct_colbert.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` Note that to ensure maximum reproducibility, by default Pyserini uses pre-computed query representations that are automatically downloaded. @@ -39,9 +39,13 @@ As an alternative, to perform "on-the-fly" query encoding, see additional instru To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.tsv +``` + +Results: +``` ##################### MRR @10: 0.3350 QueriesRanked: 6980 @@ -52,13 +56,17 @@ We can also use the official TREC evaluation tool `trec_eval` to compute other m For that we first need to convert runs and qrels files to the TREC format: ```bash -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert.bf.tsv \ - --output runs/run.msmarco-passage.tct_colbert.bf.trec +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert.tsv \ + --output runs/run.msmarco-passage.tct_colbert.trec + +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.trec +``` -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.trec +Results: +``` map all 0.3416 recall_1000 all 0.9640 ``` @@ -75,27 +83,40 @@ python -m pyserini.search.faiss \ --topics msmarco-passage-dev-subset \ --encoded-queries tct_colbert-msmarco-passage-dev-subset \ --output runs/run.msmarco-passage.tct_colbert.hnsw.tsv \ - --output-format msmarco + --output-format msmarco \ + --batch-size 512 --threads 16 ``` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.hnsw.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.hnsw.tsv +``` +Results: + +``` ##################### MRR @10: 0.3345 QueriesRanked: 6980 ##################### +``` + +And TREC evaluation: -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert.hnsw.tsv \ - --output runs/run.msmarco-passage.tct_colbert.hnsw.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert.hnsw.tsv \ + --output runs/run.msmarco-passage.tct_colbert.hnsw.trec + +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.hnsw.trec +``` -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.hnsw.trec +Results: +``` map all 0.3411 recall_1000 all 0.9618 ``` @@ -116,29 +137,41 @@ python -m pyserini.search.hybrid \ sparse --index msmarco-v1-passage \ fusion --alpha 0.12 \ run --topics msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert.bf.bm25.tsv \ + --output runs/run.msmarco-passage.tct_colbert.bm25.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.bm25.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.bm25.tsv +``` +Results: + +``` ##################### MRR @10: 0.3529 QueriesRanked: 6980 ##################### +``` + +And TREC evaluation: + +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert.bm25.tsv \ + --output runs/run.msmarco-passage.tct_colbert.bm25.trec -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert.bf.bm25.tsv \ - --output runs/run.msmarco-passage.tct_colbert.bf.bm25.trec +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.bm25.trec +``` -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.bm25.trec +Results: +``` map all 0.3594 recall_1000 all 0.9698 ``` @@ -154,32 +187,44 @@ Hybrid retrieval with dense-sparse representations (with document expansion): python -m pyserini.search.hybrid \ dense --index msmarco-v1-passage.tct_colbert \ --encoded-queries tct_colbert-msmarco-passage-dev-subset \ - sparse --index msmarco-v1-passage-d2q-t5 \ + sparse --index msmarco-v1-passage.d2q-t5 \ fusion --alpha 0.22 \ run --topics msmarco-passage-dev-subset \ - --output runs/run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv \ + --output runs/run.msmarco-passage.tct_colbert.d2q-t5.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 + --batch-size 512 --threads 16 ``` To evaluate: ```bash -$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.d2q-t5.tsv +``` +Results: + +``` ##################### MRR @10: 0.3647 QueriesRanked: 6980 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-passage.tct_colbert.bf.doc2queryT5.tsv \ - --output runs/run.msmarco-passage.tct_colbert.bf.doc2queryT5.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ - runs/run.msmarco-passage.tct_colbert.bf.doc2queryT5.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-passage.tct_colbert.d2q-t5.tsv \ + --output runs/run.msmarco-passage.tct_colbert.d2q-t5.trec + +python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset \ + runs/run.msmarco-passage.tct_colbert.d2q-t5.trec +``` + +Results: +``` map all 0.3711 recall_1000 all 0.9751 ``` @@ -210,7 +255,7 @@ python -m pyserini.search.faiss \ --encoded-queries tct_colbert-msmarco-doc-dev \ --output runs/run.msmarco-doc.passage.tct_colbert.txt \ --output-format msmarco \ - --batch-size 36 --threads 12 \ + --batch-size 512 --threads 16 \ --hits 1000 --max-passage --max-passage-hits 100 ``` @@ -219,10 +264,14 @@ Replace `--encoded-queries` by `--encoder castorini/tct_colbert-msmarco` for on- To compute the official metric MRR@100 using the official evaluation scripts: ```bash -$ python -m pyserini.eval.msmarco_doc_eval \ - --judgments msmarco-doc-dev \ - --run runs/run.msmarco-doc.passage.tct_colbert.txt +python -m pyserini.eval.msmarco_doc_eval \ + --judgments msmarco-doc-dev \ + --run runs/run.msmarco-doc.passage.tct_colbert.txt +``` +Results: + +``` ##################### MRR @100: 0.3323 QueriesRanked: 5193 @@ -232,13 +281,17 @@ QueriesRanked: 5193 To compute additional metrics using `trec_eval`, we first need to convert the run to TREC format: ```bash -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-doc.passage.tct_colbert.txt \ - --output runs/run.msmarco-doc.passage.tct_colbert.trec +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-doc.passage.tct_colbert.txt \ + --output runs/run.msmarco-doc.passage.tct_colbert.trec -$ python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ - runs/run.msmarco-doc.passage.tct_colbert.trec +python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ + runs/run.msmarco-doc.passage.tct_colbert.trec +``` + +Results: +``` map all 0.3323 recall_100 all 0.8664 ``` @@ -254,9 +307,9 @@ python -m pyserini.search.hybrid \ sparse --index msmarco-v1-doc-segmented \ fusion --alpha 0.25 \ run --topics msmarco-doc-dev \ - --output runs/run.msmarco-doc.tct_colbert.bf.bm25.tsv \ + --output runs/run.msmarco-doc.tct_colbert.bm25.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 \ + --batch-size 512 --threads 16 \ --hits 1000 --max-passage --max-passage-hits 100 ``` @@ -265,22 +318,34 @@ Replace `--encoded-queries` by `--encoder castorini/tct_colbert-msmarco` for on- To evaluate: ```bash -$ python -m pyserini.eval.msmarco_doc_eval \ - --judgments msmarco-doc-dev \ - --run runs/run.msmarco-doc.tct_colbert.bf.bm25.tsv +python -m pyserini.eval.msmarco_doc_eval \ + --judgments msmarco-doc-dev \ + --run runs/run.msmarco-doc.tct_colbert.bm25.tsv +``` +Results: + +``` ##################### MRR @100: 0.3701 QueriesRanked: 5193 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-doc.tct_colbert.bf.bm25.tsv \ - --output runs/run.msmarco-doc.tct_colbert.bf.bm25.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ - runs/run.msmarco-doc.tct_colbert.bf.bm25.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-doc.tct_colbert.bm25.tsv \ + --output runs/run.msmarco-doc.tct_colbert.bm25.trec + +python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ + runs/run.msmarco-doc.tct_colbert.bm25.trec +``` + +Results: +``` map all 0.3701 recall_100 all 0.9020 ``` @@ -293,12 +358,12 @@ Dense-sparse hybrid retrieval (with document expansion): python -m pyserini.search.hybrid \ dense --index msmarco-v1-doc.tct_colbert \ --encoded-queries tct_colbert-msmarco-doc-dev \ - sparse --index msmarco-v1-doc-segmented-d2q-t5 \ + sparse --index msmarco-v1-doc-segmented.d2q-t5 \ fusion --alpha 0.32 \ run --topics msmarco-doc-dev \ - --output runs/run.msmarco-doc.tct_colbert.bf.doc2queryT5.tsv \ + --output runs/run.msmarco-doc.tct_colbert.d2q-t5.tsv \ --output-format msmarco \ - --batch-size 36 --threads 12 \ + --batch-size 512 --threads 16 \ --hits 1000 --max-passage --max-passage-hits 100 ``` @@ -307,22 +372,34 @@ Replace `--encoded-queries` by `--encoder castorini/tct_colbert-msmarco` for on- To evaluate: ```bash -$ python -m pyserini.eval.msmarco_doc_eval \ - --judgments msmarco-doc-dev \ - --run runs/run.msmarco-doc.tct_colbert.bf.doc2queryT5.tsv +python -m pyserini.eval.msmarco_doc_eval \ + --judgments msmarco-doc-dev \ + --run runs/run.msmarco-doc.tct_colbert.d2q-t5.tsv +``` + +Results: +``` ##################### MRR @100: 0.3784 QueriesRanked: 5193 ##################### +``` -$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \ - --input runs/run.msmarco-doc.tct_colbert.bf.doc2queryT5.tsv \ - --output runs/run.msmarco-doc.tct_colbert.bf.doc2queryT5.trec +And TREC evaluation: -$ python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ - runs/run.msmarco-doc.tct_colbert.bf.doc2queryT5.trec +```bash +python -m pyserini.eval.convert_msmarco_run_to_trec_run \ + --input runs/run.msmarco-doc.tct_colbert.d2q-t5.tsv \ + --output runs/run.msmarco-doc.tct_colbert.d2q-t5.trec +python -m pyserini.eval.trec_eval -c -mrecall.100 -mmap msmarco-doc-dev \ + runs/run.msmarco-doc.tct_colbert.d2q-t5.trec +``` + +Results: + +``` map all 0.3784 recall_100 all 0.9083 ```