-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement highlighting rerankers #1
Conversation
daemon
commented
Apr 21, 2020
- Add T5, transformer, and BM25 rerankers
- Add Kaggle dataset and evaluation framework
- Add T5, transformer, and BM25 rerankers - Add Kaggle dataset and evaluation framework
|
- Add missing activation command
- IDF not computed correctly
@daemon in the ground truth, instead of a nested structure of category and sub category, have a flat structure, and have category and sub-category be attributes? This way, it'll be easier to move around the "blocks" in case the structure is too restrictive. |
I'll remember it. |
return SingleEncoderOutput(output.encoder_output[indices], output.token_ids[indices], output.text) | ||
|
||
|
||
class LongBatchEncoder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use tokenizer.batch_encode()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably write code documentation -- LongBatchEncoder strides across the sequence dimension of a batch of long documents and outputs the encoder representations computed by encoder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, thanks for clarifying!
pygaggle/rerank/identity.py
Outdated
@@ -3,12 +3,15 @@ | |||
from pygaggle.rerank import Reranker, Query, Text | |||
|
|||
|
|||
__all__ = ['IdentityReranker'] | |||
|
|||
|
|||
class IdentityReranker(Reranker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it as a demo. I think it's useful as a stub and for demonstrating the contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put an example of how to use the class in its docstring and remove any code that is not being used
@daemon - Also add the name of the annotator so we can reconcile differences later. |
pygaggle/model/evaluate.py
Outdated
METRIC_MAP = OrderedDict() | ||
|
||
|
||
class Metric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, can we use sklearn.metrics or pytrec_eval to compute the metrics? I'm afraid of writing our own metrics functions because it is hard to find bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
pygaggle/rerank/tfidf.py
Outdated
query_words_lower = {x.lower() for x in query_words} | ||
sentences_lower = [[w.lower() for w in sent] for sent in sentences] | ||
sentence_sets = list(map(set, sentences_lower)) | ||
idfs = {w: math.log(len(sentence_sets) / (1 + sum(int(w in sent) for sent in sentence_sets))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the IDFs be estimated from the entire corpus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. In this context, is the entire corpus the entire set of sentences? Or the entire set of sets of sentences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn.TFIDFvectorizer and BM25 compute the IDF from the entire corpus as the term frequencies will be more accurate. In this case, if the document is short, idfs
will have poor frequency estimations... Can we use pyserini to compute it? Something like:
# Computes the BM25 score for a particular term in a document:
bm25_score = index_utils.compute_bm25_term_weight('FBIS4-67701', analyzed[0])
pygaggle/rerank/tfidf.py
Outdated
class TfIdfReranker(Reranker): | ||
def __init__(self, | ||
word_tokenizer: Callable[[str], List[str]], | ||
k1: float = 1.6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use pyserini to compute the BM25 score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it.
- Add option to compute IDF statistics from corpus
With better IDF computation:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks again for implementing all this!
One last thing only: could you add some documentation for LongBatchEncoder
so it more clear why it is necessary?