Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get position list function #223

Merged
merged 7 commits into from
Aug 6, 2020

Conversation

nsndimt
Copy link
Contributor

@nsndimt nsndimt commented Jul 27, 2020

expose getTermPositions in the IndexReader class. Aside from the term positions mapping, this function also returns a reconstructed document using the mapping. This should help users to the effect of stop words removal and stemming. Currently, the document returned by doc.contents() is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case also?

@@ -72,6 +72,15 @@ print(doc_vector)
```

The result is a dictionary where the keys are the analyzed terms and the values are the term frequencies.

If you want to know the positions of each term in the document, you can use `get_term_positions`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"of each term" -> "of every term"?

print(term_positions)
print(indexed_doc)
```
The result is a tuple. The first member is a dictionary where the keys are the analyzed terms and the values are the positions each term occur in the document. The second member is a string containing the recovered document content using the position information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think you can just write Tuple[Dict[str, int], str] - a Python programmer should be able to interpret type signatures, and it's more concise.

-------
Optional[Tuple[Dict[str, int], str]]
A tuple contains a dictionary with analyzed terms as keys and corresponding posting list as values, and a
string representing the recovered document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to return "string representing the recovered document"? What's the use case? Users can easily do this also if they want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the document returned by doc.contents() is the document before it is processed by Lucene. The reconstructed document will provide a view of the document after Lucene's processing.
Yes. Users can easily do this if they want, and not every user needs this. I think we can move this piece of code to usage-indexreader.md as an example of how to use this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - can you do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lintool lintool merged commit 49fd7cb into castorini:master Aug 6, 2020
@nsndimt nsndimt deleted the add_get_position_list branch November 16, 2020 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants