Implement highlighting rerankers #1

daemon · 2020-04-21T02:25:54Z

Add T5, transformer, and BM25 rerankers
Add Kaggle dataset and evaluation framework

- Add T5, transformer, and BM25 rerankers - Add Kaggle dataset and evaluation framework

daemon · 2020-04-21T02:26:06Z

results/bert.log
Recall@1	0.06613756613756615
Precision@1	0.09523809523809523

results/biobert.log
Recall@1	0.09312169312169313
Precision@1	0.12698412698412698

results/scibert.log
Recall@1	0.005291005291005291
Precision@1	0.015873015873015872

results/t5.log
Recall@1	0.19365079365079363
Precision@1	0.2857142857142857

results/tfidf.log
Recall@1	0.06931216931216932
Precision@1	0.1111111111111111

- Add missing activation command

- IDF not computed correctly

lintool · 2020-04-21T02:46:38Z

@daemon in the ground truth, instead of a nested structure of category and sub category, have a flat structure, and have category and sub-category be attributes? This way, it'll be easier to move around the "blocks" in case the structure is too restrictive.

daemon · 2020-04-21T03:58:02Z

I'll remember it.

pygaggle/model/decode.py

rodrigonogueira4 · 2020-04-21T12:21:21Z

pygaggle/model/encode.py

+        return SingleEncoderOutput(output.encoder_output[indices], output.token_ids[indices], output.text)
+
+
+class LongBatchEncoder:


Can we use tokenizer.batch_encode() instead?

I should probably write code documentation -- LongBatchEncoder strides across the sequence dimension of a batch of long documents and outputs the encoder representations computed by encoder.

Ah, I see, thanks for clarifying!

pygaggle/model/serialize.py

pygaggle/model/tokenize.py

rodrigonogueira4 · 2020-04-21T12:26:13Z

pygaggle/rerank/identity.py

@@ -3,12 +3,15 @@
 from pygaggle.rerank import Reranker, Query, Text


+__all__ = ['IdentityReranker']
+
+
 class IdentityReranker(Reranker):


Why do we need this class?

I wrote it as a demo. I think it's useful as a stub and for demonstrating the contract?

I would put an example of how to use the class in its docstring and remove any code that is not being used

lintool · 2020-04-21T12:44:07Z

@daemon - Also add the name of the annotator so we can reconcile differences later.

rodrigonogueira4 · 2020-04-21T12:59:41Z

pygaggle/model/evaluate.py

+METRIC_MAP = OrderedDict()
+
+
+class Metric:


Same here, can we use sklearn.metrics or pytrec_eval to compute the metrics? I'm afraid of writing our own metrics functions because it is hard to find bugs

rodrigonogueira4 · 2020-04-21T13:02:26Z

pygaggle/rerank/tfidf.py

+        query_words_lower = {x.lower() for x in query_words}
+        sentences_lower = [[w.lower() for w in sent] for sent in sentences]
+        sentence_sets = list(map(set, sentences_lower))
+        idfs = {w: math.log(len(sentence_sets) / (1 + sum(int(w in sent) for sent in sentence_sets)))


Shouldn't the IDFs be estimated from the entire corpus?

I'm not sure. In this context, is the entire corpus the entire set of sentences? Or the entire set of sets of sentences?

sklearn.TFIDFvectorizer and BM25 compute the IDF from the entire corpus as the term frequencies will be more accurate. In this case, if the document is short, idfs will have poor frequency estimations... Can we use pyserini to compute it? Something like:

# Computes the BM25 score for a particular term in a document: bm25_score = index_utils.compute_bm25_term_weight('FBIS4-67701', analyzed[0])

rodrigonogueira4 · 2020-04-21T13:04:01Z

pygaggle/rerank/tfidf.py

+class TfIdfReranker(Reranker):
+    def __init__(self,
+                 word_tokenizer: Callable[[str], List[str]],
+                 k1: float = 1.6,


Can we use pyserini to compute the BM25 score?

I'll look into it.

- Add option to compute IDF statistics from corpus

daemon · 2020-04-21T16:53:11Z

With better IDF computation:

Recall@1	0.07195767195767196
Precision@1	0.14285714285714285

rodrigonogueira4

Awesome! Thanks again for implementing all this!

One last thing only: could you add some documentation for LongBatchEncoder so it more clear why it is necessary?

Implement highlighting rerankers

bf2eafb

- Add T5, transformer, and BM25 rerankers - Add Kaggle dataset and evaluation framework

daemon added 2 commits April 20, 2020 22:30

Fix README instructions

ee765ae

- Add missing activation command

Fix BM25 bug

42ca2b0

- IDF not computed correctly

lintool requested a review from rodrigonogueira4 April 21, 2020 02:46

rodrigonogueira4 reviewed Apr 21, 2020

View reviewed changes

pygaggle/model/decode.py Show resolved Hide resolved

rodrigonogueira4 reviewed Apr 21, 2020

View reviewed changes

pygaggle/model/serialize.py Show resolved Hide resolved

rodrigonogueira4 reviewed Apr 21, 2020

View reviewed changes

pygaggle/model/tokenize.py Show resolved Hide resolved

rodrigonogueira4 reviewed Apr 21, 2020

View reviewed changes

Improve IDF computation for BM25 reranker

543ba72

- Add option to compute IDF statistics from corpus

rodrigonogueira4 approved these changes Apr 21, 2020

View reviewed changes

Add LongBatchEncoder documentation

f8e7ee2

daemon merged commit b2ab0c9 into master Apr 21, 2020

daemon deleted the daemon/highlighter branch April 21, 2020 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement highlighting rerankers #1

Implement highlighting rerankers #1

daemon commented Apr 21, 2020

daemon commented Apr 21, 2020 •

edited

Loading

lintool commented Apr 21, 2020

daemon commented Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

daemon Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

lintool Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

lintool commented Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

daemon Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

daemon Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

rodrigonogueira4 Apr 21, 2020

daemon Apr 21, 2020

daemon commented Apr 21, 2020

rodrigonogueira4 left a comment

		return SingleEncoderOutput(output.encoder_output[indices], output.token_ids[indices], output.text)


		class LongBatchEncoder:

Implement highlighting rerankers #1

Implement highlighting rerankers #1

Conversation

daemon commented Apr 21, 2020

daemon commented Apr 21, 2020 • edited Loading

lintool commented Apr 21, 2020

daemon commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lintool commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daemon commented Apr 21, 2020

rodrigonogueira4 left a comment

Choose a reason for hiding this comment

daemon commented Apr 21, 2020 •

edited

Loading