From 7caedfc150f916de302297406c45dead27b475ba Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Sat, 2 Jan 2021 11:17:09 -0500 Subject: [PATCH] Updated documentation about pre-built indexes (#288) --- docs/prebuilt-indexes.md | 40 ++++++++++++++++++++++++++++++++++++--- docs/usage-indexreader.md | 3 +++ 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/docs/prebuilt-indexes.md b/docs/prebuilt-indexes.md index a2758f8c6..836b5a282 100644 --- a/docs/prebuilt-indexes.md +++ b/docs/prebuilt-indexes.md @@ -1,13 +1,47 @@ # Pyserini: Prebuilt Indexes Pre-built Anserini indexes are hosted at the University of Waterloo's [GitLab](https://git.uwaterloo.ca/jimmylin/anserini-indexes) and mirrored on Dropbox. -The following method will list available pre-built indexes: +The following methods will list available pre-built indexes: -``` +```python +from pyserini.search import SimpleSearcher SimpleSearcher.list_prebuilt_indexes() + +from pyserini.index import IndexReader +IndexReader.list_prebuilt_indexes() +``` + +It's easy initialize a searcher from a pre-built index: + +```python +searcher = SimpleSearcher.from_prebuilt_index('robust04') +``` + +You can use this simple Python one-liner to download the pre-built index: + +``` +python -c "from pyserini.search import SimpleSearcher; SimpleSearcher.from_prebuilt_index('robust04')" ``` -Below is a summary of what's currently available: +The downloaded index will be in `~/.cache/pyserini/indexes/`. + +It's similarly easy initialize an index reader from a pre-built index: + +```python +index_reader = IndexReader.from_prebuilt_index('robust04') +index_reader.stats() +``` + +The output will be: + +``` +{'total_terms': 174540872, 'documents': 528030, 'non_empty_documents': 528030, 'unique_terms': 923436} +``` + +Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1. +Nope, that's not a bug. + +Below is a summary of the pre-built indexes that are currently available. ## MS MARCO Indexes diff --git a/docs/usage-indexreader.md b/docs/usage-indexreader.md index af846b09d..72a77b5c7 100644 --- a/docs/usage-indexreader.md +++ b/docs/usage-indexreader.md @@ -162,3 +162,6 @@ Output is something like this: 'non_empty_documents': 528030, 'unique_terms': 923436} ``` + +Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1. +Nope, that's not a bug.