-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expose document level term positions #1337
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1337 +/- ##
============================================
+ Coverage 51.74% 51.79% +0.04%
- Complexity 802 805 +3
============================================
Files 154 154
Lines 8604 8625 +21
Branches 1219 1224 +5
============================================
+ Hits 4452 4467 +15
- Misses 3783 3786 +3
- Partials 369 372 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case also?
* @throws IOException if error encountered during query | ||
* @throws NotStoredException if the term vector is not stored | ||
*/ | ||
public static Map<String, List<Long>> getPositionList(IndexReader reader, String docid) throws IOException, NotStoredException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about getTermPositions
?
} | ||
|
||
Map<String, List<Long>> term_position = new HashMap<>(); | ||
PostingsEnum position_iter = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables should be camel case?
for ( int i = 0; i < freq; i++ ) { | ||
positions.add(Long.valueOf(position_iter.nextPosition())); | ||
} | ||
term_position.put(te.term().utf8ToString(), positions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
camel case.
throw new NotStoredException("Document vector not stored!"); | ||
} | ||
|
||
Map<String, List<Long>> termPosition = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this - do we need to use long? I don't think documents are going to be that looong? How about just List<Integer>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. position_iter.nextPosition()
return int. Using long is unnecessary. I have changed the declaration of termPostion
accordingly.
I also discovered that Long termFreq = termIter.totalTermFreq();
introduce autoboxing. I have changed Long
to long
to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, autoboxing is inefficient, but since you have List<Integer>
, you'll have to convert it back into an object anyway...
Another comment for you... |
Currently, get_document_vector only return terms and corresponding frequencies. This pull request implements a new function called getPositionList in which returns terms and, for each term, every position this term occurs in the document.