-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get position list function #223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case also?
@@ -72,6 +72,15 @@ print(doc_vector) | |||
``` | |||
|
|||
The result is a dictionary where the keys are the analyzed terms and the values are the term frequencies. | |||
|
|||
If you want to know the positions of each term in the document, you can use `get_term_positions`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"of each term" -> "of every term"?
docs/usage-indexreader.md
Outdated
print(term_positions) | ||
print(indexed_doc) | ||
``` | ||
The result is a tuple. The first member is a dictionary where the keys are the analyzed terms and the values are the positions each term occur in the document. The second member is a string containing the recovered document content using the position information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think you can just write Tuple[Dict[str, int], str]
- a Python programmer should be able to interpret type signatures, and it's more concise.
pyserini/index/_base.py
Outdated
------- | ||
Optional[Tuple[Dict[str, int], str]] | ||
A tuple contains a dictionary with analyzed terms as keys and corresponding posting list as values, and a | ||
string representing the recovered document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to return "string representing the recovered document"? What's the use case? Users can easily do this also if they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the document returned by doc.contents()
is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.
Yes. Users can easily do this if they want, and not every user needs this. I think we can move this piece of code to usage-indexreader.md
as an example of how to use this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - can you do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
expose getTermPositions in the IndexReader class. Aside from the term positions mapping, this function also returns a reconstructed document using the mapping. This should help users to the effect of stop words removal and stemming. Currently, the document returned by doc.contents() is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.