-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
222f201
to
55f6564
Compare
55f6564
to
17941b0
Compare
@@ -0,0 +1,1017 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #7. def scoreddocs_qrels_confusion_matrix(
Can you break this function up? A lot going on here, hard to follow
Reply via ReviewNB
...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb
Show resolved
Hide resolved
...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb
Show resolved
Hide resolved
@@ -0,0 +1,1017 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #8. class CustomTermFeedback(Provider):
Should we have OOB feedbacks for these metrics?
Reply via ReviewNB
...rk/benchmark_frameworks/experiments/context_relevance/trec_dl_passage_relevance_judges.ipynb
Show resolved
Hide resolved
8cabc46
to
9514bce
Compare
Description
Contain 2 main notebooks
trec_dl_passage_relevance_judges.ipynb
: Experiments code with TREC DL passage level agreement and comparisons with Cortex team's hand-crafted prompt, UMBRELA prompt, and categorical prompt for search relevance LLM judges.Doc with details: https://docs.google.com/document/d/1lwNC11S3LnNHglDNzcJNgPsd7zpOGNwTNktn8pI3bUc/edit?tab=t.nnid3pbdfr33
llm_aggrefact_groundedness_benchmark.ipynb
: Experiments code with LLM-AggreFact benchmark for fact-checking grounding documents and comparing Bespoke Minicheck-7B vs trulens groundedness (gpt-4o).Doc: https://docs.google.com/document/d/1oMWDlk6JK6HP3IJQ_pjqB5KlHMtDHUKNoi5uTyWEYhk/edit?disco=AAABZ2Ar0vc
Other details good to know for developers
Please include any other details of this change useful for TruLens developers.
Type of change
not work as expected)
Important
Add functions for generating balanced datasets and TREC DL passage benchmarks, and enhance feedback handling to support additional data types.
generate_balanced_llm_aggrefact_benchmark
andgenerate_trec_dl_passage_benchmark
indataset_preprocessing.py
for creating balanced datasets and TREC DL passage benchmarks.run()
infeedback.py
to supportdict
as a valid feedback function output type.run_score_generation_on_single_row()
intru_benchmark_experiment.py
tofloat('nan')
for invalid results.defaultdict
andir_datasets
indataset_preprocessing.py
.This description was created by for 35432e7. It will automatically update as commits are pushed.