TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

sfc-gh-dhuang · 2024-11-20T20:02:18Z

Description

Contain 2 main notebooks

trec_dl_passage_relevance_judges.ipynb: Experiments code with TREC DL passage level agreement and comparisons with Cortex team's hand-crafted prompt, UMBRELA prompt, and categorical prompt for search relevance LLM judges.

Doc with details: https://docs.google.com/document/d/1lwNC11S3LnNHglDNzcJNgPsd7zpOGNwTNktn8pI3bUc/edit?tab=t.nnid3pbdfr33

llm_aggrefact_groundedness_benchmark.ipynb: Experiments code with LLM-AggreFact benchmark for fact-checking grounding documents and comparing Bespoke Minicheck-7B vs trulens groundedness (gpt-4o).
Doc: https://docs.google.com/document/d/1oMWDlk6JK6HP3IJQ_pjqB5KlHMtDHUKNoi5uTyWEYhk/edit?disco=AAABZ2Ar0vc

Other details good to know for developers

Please include any other details of this change useful for TruLens developers.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to
not work as expected)
New Tests
This change includes re-generated golden test results
This change requires a documentation update

Important

Add functions for generating balanced datasets and TREC DL passage benchmarks, and enhance feedback handling to support additional data types.

New Functions:
- generate_balanced_llm_aggrefact_benchmark and generate_trec_dl_passage_benchmark in dataset_preprocessing.py for creating balanced datasets and TREC DL passage benchmarks.
Feedback Handling:
- Modified run() in feedback.py to support dict as a valid feedback function output type.
- Changed return value in run_score_generation_on_single_row() in tru_benchmark_experiment.py to float('nan') for invalid results.
Misc:
- Added imports for defaultdict and ir_datasets in dataset_preprocessing.py.

^{This description was created by}^{for 35432e7. It will automatically update as commits are pushed.}

review-notebook-app · 2024-11-20T20:02:23Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…trumentation

sfc-gh-jreini · 2024-12-06T22:14:10Z

...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb

@@ -0,0 +1,1017 @@
+{


Line #7. def scoreddocs_qrels_confusion_matrix(
Can you break this function up? A lot going on here, hard to follow

Reply via ReviewNB

...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb

sfc-gh-jreini · 2024-12-06T22:14:10Z

...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb

@@ -0,0 +1,1017 @@
+{


Line #8. class CustomTermFeedback(Provider):
Should we have OOB feedbacks for these metrics?

Reply via ReviewNB

...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb

save

5aea893

sfc-gh-dhuang force-pushed the trec-exp-adding-rubric branch from 222f201 to 55f6564 Compare November 25, 2024 19:09

sfc-gh-dhuang added 7 commits November 25, 2024 11:09

improve trubasicapp setup for competitive experiments, add mlflow ins…

6a8d32e

…trumentation

save

7968d52

save progress

6b91ffe

save

65d92ce

save

b8325a4

Merge branch 'daniel/prompt-opt' into trec-exp-adding-rubric

cf5979a

add agreement analysis with scoreddocs

17941b0

sfc-gh-dhuang force-pushed the trec-exp-adding-rubric branch from 55f6564 to 17941b0 Compare November 25, 2024 23:02

notebook updates

fbe2ae4

sfc-gh-dhuang changed the title ~~TREC experiment and adding rubric to all prompts~~ TREC DL experiments + prompts comparisons Dec 3, 2024

sfc-gh-dhuang added 5 commits December 3, 2024 12:22

Merge branch 'main' into trec-exp-adding-rubric

dd3c1f7

cleanup

99d2a2b

dataset preprocessing script update

78e3fb5

cleanup competitive analysis

4206707

Merge branch 'main' into trec-exp-adding-rubric

8554d04

sfc-gh-dhuang requested a review from sfc-gh-jreini December 3, 2024 23:14

sfc-gh-dhuang marked this pull request as ready for review December 3, 2024 23:14

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Dec 3, 2024

sfc-gh-dhuang requested a review from sfc-gh-adatta December 3, 2024 23:15

sfc-gh-dhuang changed the title ~~TREC DL experiments + prompts comparisons~~ TREC DL experiments to establish context relevance benchmark + prompts comparisons Dec 3, 2024

add llm aggrefact nb

fb9c246

sfc-gh-dhuang marked this pull request as draft December 4, 2024 20:29

add llm-aggrefact experiment notebook

35432e7

sfc-gh-dhuang marked this pull request as ready for review December 5, 2024 01:06

Merge branch 'main' into trec-exp-adding-rubric

360fbb8

sfc-gh-dhuang requested review from sfc-gh-dkurokawa and sfc-gh-chu December 5, 2024 01:06

sfc-gh-dhuang changed the title ~~TREC DL experiments to establish context relevance benchmark + prompts comparisons~~ TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B Dec 5, 2024

sfc-gh-dhuang added 2 commits December 4, 2024 19:24

move notebooks

cb743c5

Merge branch 'main' into trec-exp-adding-rubric

680eab4

sfc-gh-jreini reviewed Dec 6, 2024

View reviewed changes

edits

9514bce

sfc-gh-dhuang force-pushed the trec-exp-adding-rubric branch from 8cabc46 to 9514bce Compare December 9, 2024 05:21

pr comments

4b49b7b

sfc-gh-dhuang requested a review from sfc-gh-jreini December 9, 2024 05:23

sfc-gh-jreini approved these changes Dec 10, 2024

View reviewed changes

sfc-gh-dhuang added 2 commits December 9, 2024 17:48

Merge branch 'main' into trec-exp-adding-rubric

d0909a7

add back e2e/data nb

0975049

sfc-gh-dhuang merged commit f549d91 into main Dec 10, 2024
3 of 10 checks passed

sfc-gh-dhuang deleted the trec-exp-adding-rubric branch December 10, 2024 02:16

sfc-gh-chu mentioned this pull request Dec 16, 2024

TruLens 1.2.11 #1697

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

sfc-gh-dhuang commented Nov 20, 2024 •

edited

Loading

review-notebook-app bot commented Nov 20, 2024

sfc-gh-jreini Dec 6, 2024 •

edited

Loading

sfc-gh-jreini Dec 6, 2024 •

edited

Loading

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

Conversation

sfc-gh-dhuang commented Nov 20, 2024 • edited Loading

Description

Other details good to know for developers

Type of change

review-notebook-app bot commented Nov 20, 2024

sfc-gh-jreini Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-jreini Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-dhuang commented Nov 20, 2024 •

edited

Loading

sfc-gh-jreini Dec 6, 2024 •

edited

Loading

sfc-gh-jreini Dec 6, 2024 •

edited

Loading