Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add msmarco v2 document segmentation script #706

Merged
merged 32 commits into from
Jul 16, 2021
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5a66347
add tasb msmarco dev subset reproduce
May 26, 2021
066c3d5
resolve version comment
May 27, 2021
397280e
manually resolve conflict
Jun 5, 2021
8b74b0a
initialize tct-colbert-v2 doc
Jun 8, 2021
0ef7d58
fix alpha for doct5query fusion
Jun 8, 2021
044eabb
add baseline
justram Jun 8, 2021
437ad41
Merge branch 'master' of github.com:jacklin64/pyserini
justram Jun 8, 2021
9f56a36
add tct_colbert-v2 integration test
justram Jun 8, 2021
2457f6d
add distilbert_tasb integration
Jun 8, 2021
5d13d11
fix typo
justram Jun 8, 2021
0be15ea
add baseline exp
justram Jun 8, 2021
4bd4231
Merge branch 'master' of github.com:jacklin64/pyserini
justram Jun 8, 2021
9f3729f
rearrange
justram Jun 8, 2021
2936de6
rearrange tct-v2 exp order
justram Jun 8, 2021
555c87b
resolve conflict
justram Jun 8, 2021
f6180a0
fix function name
Jun 8, 2021
853cba9
Delete test_distilbert_tasb.py
jacklin64 Jun 8, 2021
cbd1c53
Delete test_tct_colbert-v2.py
jacklin64 Jun 8, 2021
2b24168
clarify the results in the table
Jun 8, 2021
719de11
add tasb and tct-v2 integration
Jun 10, 2021
7737c26
Merge branch 'castorini:master' into master
jacklin64 Jun 10, 2021
557dd2b
Merge branch 'castorini:master' into master
justram Jun 24, 2021
975e4b6
add tct doc encoding
Jun 26, 2021
e75aa36
resolve conflict
justram Jul 2, 2021
b14f182
Merge branch 'castorini-master'
justram Jul 2, 2021
53ee49c
sync to master
Jul 16, 2021
df2659b
Merge pull request #3 from castorini/master
jacklin64 Jul 16, 2021
e3de279
add msmarco v2 document segmentation
Jul 16, 2021
196ee31
Merge branch 'master' of github.com:jacklin64/pyserini
Jul 16, 2021
df0ac90
rename and add boilerplate header.
Jul 16, 2021
5b2cccb
delete redundant file
Jul 16, 2021
ba5a40b
fix description and help
Jul 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
rearrange
  • Loading branch information
justram committed Jun 8, 2021
commit 9f3729f0c3d5bfe2a81de6f1f3c88c908cd643a1
79 changes: 40 additions & 39 deletions docs/experiments-tct_colbert-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,29 +25,27 @@ Summary of results:

Here we notice slight difference between our paper (TF) and reproduction (PT).

## TCT_ColBERT-V2-HN+ Reproduction

## TCT_ColBERT-V2 Reproduction
Dense retrieval with TCT-ColBERT, brute-force index:

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-hnp-bf \
--index msmarco-passage-tct_colbert-v2-bf \
--encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
--batch-size 36 \
--threads 12 \
--output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv \
--output runs/run.msmarco-passage.tct_colbert-v2.bf.tsv \
--output-format msmarco
```

Note that to ensure maximum reproducibility, by default Pyserini uses pre-computed query representations that are automatically downloaded.
As an alternative, to perform "on-the-fly" query encoding, see additional instructions below.

To evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2.bf.tsv
#####################
MRR @10: 0.3584
MRR @10: 0.3439
QueriesRanked: 6980
#####################
```
Expand All @@ -56,76 +54,79 @@ We can also use the official TREC evaluation tool `trec_eval` to compute other m
For that we first need to convert runs and qrels files to the TREC format:

```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec
map all 0.3645
recall_1000 all 0.9695
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2.bf.trec
map all 0.3509
recall_1000 all 0.9670
```

To perform on-the-fly query encoding with our [pretrained encoder model](https://huggingface.co/castorini/tct_colbert-msmarco/tree/main) use the option `--encoder castorini/tct_colbert-v2-hnp-msmarco`.
Query encoding will run on the CPU by default.
To perform query encoding on the GPU, use the option `--device cuda:0`.


Follow the same instructions above to perform on-the-fly query encoding.
The caveat about minor differences in score applies here as well.

## TCT_ColBERT-V2 Reproduction
## TCT_ColBERT-V2-HN Reproduction

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-bf \
--encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-hn-bf \
--encoded-queries tct_colbert-v2-hn-msmarco-passage-dev-subset \
--batch-size 36 \
--threads 12 \
--output runs/run.msmarco-passage.tct_colbert-v2.bf.tsv \
--output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \
--output-format msmarco
```

To evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2.bf.tsv
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv
#####################
MRR @10: 0.3439
MRR @10: 0.3542
QueriesRanked: 6980
#####################
```

```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2.bf.trec
map all 0.3509
recall_1000 all 0.9670
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec
map all 0.3608
recall_1000 all 0.9708
```

## TCT_ColBERT-V2-HN Reproduction
## TCT_ColBERT-V2-HN+ Reproduction

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-hn-bf \
--encoded-queries tct_colbert-v2-hn-msmarco-passage-dev-subset \
--index msmarco-passage-tct_colbert-v2-hnp-bf \
--encoded-queries tct_colbert-v2-hnp-msmarco-passage-dev-subset \
--batch-size 36 \
--threads 12 \
--output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv \
--output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv \
--output-format msmarco
```

To evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv
#####################
MRR @10: 0.3542
MRR @10: 0.3584
QueriesRanked: 6980
#####################
```

```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2-hn.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hn.bf.trec
map all 0.3608
recall_1000 all 0.9708
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.tsv --output runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.tct_colbert-v2-hnp.bf.trec
map all 0.3645
recall_1000 all 0.9695
```

To perform on-the-fly query encoding with our [pretrained encoder model](https://huggingface.co/castorini/tct_colbert-msmarco/tree/main) use the option `--encoder castorini/tct_colbert-v2-hnp-msmarco`.
Query encoding will run on the CPU by default.
To perform query encoding on the GPU, use the option `--device cuda:0`.


Follow the same instructions above to perform on-the-fly query encoding.
The caveat about minor differences in score applies here as well.



### Hybrid Dense-Sparse Retrieval

Expand Down