Add get position list function #223

nsndimt · 2020-07-27T23:00:57Z

expose getTermPositions in the IndexReader class. Aside from the term positions mapping, this function also returns a reconstructed document using the mapping. This should help users to the effect of stop words removal and stemming. Currently, the document returned by doc.contents() is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.

…cument;

lintool

Can we add a test case also?

lintool · 2020-08-05T16:33:50Z

docs/usage-indexreader.md

@@ -72,6 +72,15 @@ print(doc_vector)
 ```

 The result is a dictionary where the keys are the analyzed terms and the values are the term frequencies.
+
+If you want to know the positions of each term in the document, you can use `get_term_positions`:


"of each term" -> "of every term"?

lintool · 2020-08-05T16:34:39Z

docs/usage-indexreader.md

+print(term_positions)
+print(indexed_doc)
+```
+The result is a tuple. The first member is a dictionary where the keys are the analyzed terms and the values are the positions each term occur in the document. The second member is a string containing the recovered document content using the position information.


Here I think you can just write Tuple[Dict[str, int], str] - a Python programmer should be able to interpret type signatures, and it's more concise.

lintool · 2020-08-05T16:36:48Z

pyserini/index/_base.py

+        -------
+        Optional[Tuple[Dict[str, int], str]]
+            A tuple contains a dictionary with analyzed terms as keys and corresponding posting list as values, and a
+            string representing the recovered document


Why do you want to return "string representing the recovered document"? What's the use case? Users can easily do this also if they want?

Currently, the document returned by doc.contents() is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.
Yes. Users can easily do this if they want, and not every user needs this. I think we can move this piece of code to usage-indexreader.md as an example of how to use this function.

Agreed - can you do that?

nsndimt added 5 commits June 21, 2020 10:05

add get_document_posting and reorganize_postings

74af29e

Merge branch 'master' into add_getDocumentPostings

4fee720

merge two close related functions

ee3469a

improve variable and function naming

ef672c2

update function name; add introduction to this new function in the do…

934077d

…cument;

lintool reviewed Aug 5, 2020

View reviewed changes

nsndimt added 2 commits August 6, 2020 10:21

no longer return reconstructed document; add test case;

97cdb7f

fix error in comments; update document;

8289e57

lintool approved these changes Aug 6, 2020

View reviewed changes

lintool merged commit 49fd7cb into castorini:master Aug 6, 2020

nsndimt deleted the add_get_position_list branch November 16, 2020 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get position list function #223

Add get position list function #223

nsndimt commented Jul 27, 2020 •

edited

Loading

lintool left a comment

lintool Aug 5, 2020

lintool Aug 5, 2020

lintool Aug 5, 2020

nsndimt Aug 6, 2020

lintool Aug 6, 2020

nsndimt Aug 6, 2020

Add get position list function #223

Add get position list function #223

Conversation

nsndimt commented Jul 27, 2020 • edited Loading

lintool left a comment

Choose a reason for hiding this comment

lintool Aug 5, 2020

Choose a reason for hiding this comment

lintool Aug 5, 2020

Choose a reason for hiding this comment

lintool Aug 5, 2020

Choose a reason for hiding this comment

nsndimt Aug 6, 2020

Choose a reason for hiding this comment

lintool Aug 6, 2020

Choose a reason for hiding this comment

nsndimt Aug 6, 2020

Choose a reason for hiding this comment

nsndimt commented Jul 27, 2020 •

edited

Loading