Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement highlighting rerankers #1

Merged
merged 5 commits into from
Apr 21, 2020
Merged

Implement highlighting rerankers #1

merged 5 commits into from
Apr 21, 2020

Conversation

daemon
Copy link
Member

@daemon daemon commented Apr 21, 2020

  • Add T5, transformer, and BM25 rerankers
  • Add Kaggle dataset and evaluation framework

- Add T5, transformer, and BM25 rerankers
- Add Kaggle dataset and evaluation framework
@daemon
Copy link
Member Author

daemon commented Apr 21, 2020

results/bert.log
Recall@1	0.06613756613756615
Precision@1	0.09523809523809523

results/biobert.log
Recall@1	0.09312169312169313
Precision@1	0.12698412698412698

results/scibert.log
Recall@1	0.005291005291005291
Precision@1	0.015873015873015872

results/t5.log
Recall@1	0.19365079365079363
Precision@1	0.2857142857142857

results/tfidf.log
Recall@1	0.06931216931216932
Precision@1	0.1111111111111111

daemon added 2 commits April 20, 2020 22:30
- Add missing activation command
- IDF not computed correctly
@lintool
Copy link
Member

lintool commented Apr 21, 2020

@daemon in the ground truth, instead of a nested structure of category and sub category, have a flat structure, and have category and sub-category be attributes? This way, it'll be easier to move around the "blocks" in case the structure is too restrictive.

@daemon
Copy link
Member Author

daemon commented Apr 21, 2020

I'll remember it.

return SingleEncoderOutput(output.encoder_output[indices], output.token_ids[indices], output.text)


class LongBatchEncoder:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use tokenizer.batch_encode() instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably write code documentation -- LongBatchEncoder strides across the sequence dimension of a batch of long documents and outputs the encoder representations computed by encoder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thanks for clarifying!

@@ -3,12 +3,15 @@
from pygaggle.rerank import Reranker, Query, Text


__all__ = ['IdentityReranker']


class IdentityReranker(Reranker):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it as a demo. I think it's useful as a stub and for demonstrating the contract?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put an example of how to use the class in its docstring and remove any code that is not being used

@lintool
Copy link
Member

lintool commented Apr 21, 2020

@daemon - Also add the name of the annotator so we can reconcile differences later.

METRIC_MAP = OrderedDict()


class Metric:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we use sklearn.metrics or pytrec_eval to compute the metrics? I'm afraid of writing our own metrics functions because it is hard to find bugs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

query_words_lower = {x.lower() for x in query_words}
sentences_lower = [[w.lower() for w in sent] for sent in sentences]
sentence_sets = list(map(set, sentences_lower))
idfs = {w: math.log(len(sentence_sets) / (1 + sum(int(w in sent) for sent in sentence_sets)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the IDFs be estimated from the entire corpus?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. In this context, is the entire corpus the entire set of sentences? Or the entire set of sets of sentences?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn.TFIDFvectorizer and BM25 compute the IDF from the entire corpus as the term frequencies will be more accurate. In this case, if the document is short, idfs will have poor frequency estimations... Can we use pyserini to compute it? Something like:

# Computes the BM25 score for a particular term in a document:
bm25_score = index_utils.compute_bm25_term_weight('FBIS4-67701', analyzed[0])

class TfIdfReranker(Reranker):
def __init__(self,
word_tokenizer: Callable[[str], List[str]],
k1: float = 1.6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use pyserini to compute the BM25 score?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it.

- Add option to compute IDF statistics from corpus
@daemon
Copy link
Member Author

daemon commented Apr 21, 2020

With better IDF computation:

Recall@1	0.07195767195767196
Precision@1	0.14285714285714285

Copy link
Member

@rodrigonogueira4 rodrigonogueira4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks again for implementing all this!

One last thing only: could you add some documentation for LongBatchEncoder so it more clear why it is necessary?

@daemon daemon merged commit b2ab0c9 into master Apr 21, 2020
@daemon daemon deleted the daemon/highlighter branch April 21, 2020 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants