From 7d18f4bd3f98d4f901dc061ffd93a1c656e32d0d Mon Sep 17 00:00:00 2001 From: Jimmy Lin Date: Sat, 23 Sep 2023 11:47:16 -0400 Subject: [PATCH] Update onboarding docs per #1637 (#1643) --- docs/conceptual-framework2.md | 2 +- docs/experiments-nfcorpus.md | 5 ++++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/conceptual-framework2.md b/docs/conceptual-framework2.md index bf5541916..177453beb 100644 --- a/docs/conceptual-framework2.md +++ b/docs/conceptual-framework2.md @@ -99,7 +99,7 @@ v2 = encoder.encode(doc_text) Minor detail here: the encoder is designed to work on batches of input, so the actual vector representation is `v2[0]`. -We can verify that the vector we generated using the encoder is identical to the vector that is stored in the index by computing the L2 norm (which should be zero): +We can verify that the vector we generated using the encoder is identical to the vector that is stored in the index by computing the L2 norm (which should be almost zero): ```python import numpy as np diff --git a/docs/experiments-nfcorpus.md b/docs/experiments-nfcorpus.md index dfbd8faaf..350bc9ef5 100644 --- a/docs/experiments-nfcorpus.md +++ b/docs/experiments-nfcorpus.md @@ -86,7 +86,8 @@ python -m pyserini.encode \ encoder --encoder facebook/contriever-msmarco \ --device cpu \ --pooling mean \ - --fields title text + --fields title text \ + --batch 32 ``` We're using the [`facebook/contriever-msmarco`](https://huggingface.co/facebook/contriever-msmarco) encoder, which can be found on HuggingFace. @@ -98,6 +99,7 @@ At search time, each document vector is sequentially compared to the query vecto In other words, the library just performs brute force dot products of each query vector against all document vectors. The above indexing command takes around 30 minutes to run on a modern laptop, with most of the time occupied by performing neural inference using the CPU. +Adjust the `batch` parameter above accordingly for your hardware; 32 is the default, but reduce the value if you find that the encoding is taking too long. ## Retrieval @@ -120,6 +122,7 @@ With the flat index here, we're performing brute-force computation of dot produc As a result, we are performing _exact_ search, i.e., we are finding the _exact_ top-_k_ documents that have the highest dot products. The above retrieval command takes only a few minutes on a modern laptop. +Adjust the `threads` and `batch` parameters above accordingly for your hardware. ## Evaluation