Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expose document level term positions #1337

Merged
merged 6 commits into from
Aug 4, 2020

Conversation

nsndimt
Copy link
Contributor

@nsndimt nsndimt commented Jul 27, 2020

Currently, get_document_vector only return terms and corresponding frequencies. This pull request implements a new function called getPositionList in which returns terms and, for each term, every position this term occurs in the document.

@codecov
Copy link

codecov bot commented Jul 27, 2020

Codecov Report

Merging #1337 into master will increase coverage by 0.04%.
The diff coverage is 71.42%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1337      +/-   ##
============================================
+ Coverage     51.74%   51.79%   +0.04%     
- Complexity      802      805       +3     
============================================
  Files           154      154              
  Lines          8604     8625      +21     
  Branches       1219     1224       +5     
============================================
+ Hits           4452     4467      +15     
- Misses         3783     3786       +3     
- Partials        369      372       +3     
Impacted Files Coverage Δ Complexity Δ
.../main/java/io/anserini/index/IndexReaderUtils.java 77.40% <71.42%> (-0.68%) 38.00 <3.00> (+3.00) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 62fcb52...48ed904. Read the comment docs.

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case also?

* @throws IOException if error encountered during query
* @throws NotStoredException if the term vector is not stored
*/
public static Map<String, List<Long>> getPositionList(IndexReader reader, String docid) throws IOException, NotStoredException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about getTermPositions?

}

Map<String, List<Long>> term_position = new HashMap<>();
PostingsEnum position_iter = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables should be camel case?

for ( int i = 0; i < freq; i++ ) {
positions.add(Long.valueOf(position_iter.nextPosition()));
}
term_position.put(te.term().utf8ToString(), positions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

camel case.

throw new NotStoredException("Document vector not stored!");
}

Map<String, List<Long>> termPosition = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this - do we need to use long? I don't think documents are going to be that looong? How about just List<Integer>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. position_iter.nextPosition() return int. Using long is unnecessary. I have changed the declaration of termPostion accordingly.
I also discovered that Long termFreq = termIter.totalTermFreq(); introduce autoboxing. I have changed Long to long to avoid this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, autoboxing is inefficient, but since you have List<Integer>, you'll have to convert it back into an object anyway...

@lintool
Copy link
Member

lintool commented Aug 4, 2020

Another comment for you...

@lintool lintool merged commit bd8fff8 into castorini:master Aug 4, 2020
@nsndimt nsndimt deleted the add_getPostionList branch August 17, 2020 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants