Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

Merged
merged 23 commits into from
Dec 10, 2024

Conversation

sfc-gh-dhuang
Copy link
Contributor

@sfc-gh-dhuang sfc-gh-dhuang commented Nov 20, 2024

Description

Contain 2 main notebooks

  1. trec_dl_passage_relevance_judges.ipynb: Experiments code with TREC DL passage level agreement and comparisons with Cortex team's hand-crafted prompt, UMBRELA prompt, and categorical prompt for search relevance LLM judges.

Doc with details: https://docs.google.com/document/d/1lwNC11S3LnNHglDNzcJNgPsd7zpOGNwTNktn8pI3bUc/edit?tab=t.nnid3pbdfr33

  1. llm_aggrefact_groundedness_benchmark.ipynb: Experiments code with LLM-AggreFact benchmark for fact-checking grounding documents and comparing Bespoke Minicheck-7B vs trulens groundedness (gpt-4o).
    Doc: https://docs.google.com/document/d/1oMWDlk6JK6HP3IJQ_pjqB5KlHMtDHUKNoi5uTyWEYhk/edit?disco=AAABZ2Ar0vc

Other details good to know for developers

Please include any other details of this change useful for TruLens developers.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to
    not work as expected)
  • New Tests
  • This change includes re-generated golden test results
  • This change requires a documentation update

Important

Add functions for generating balanced datasets and TREC DL passage benchmarks, and enhance feedback handling to support additional data types.

  • New Functions:
    • generate_balanced_llm_aggrefact_benchmark and generate_trec_dl_passage_benchmark in dataset_preprocessing.py for creating balanced datasets and TREC DL passage benchmarks.
  • Feedback Handling:
    • Modified run() in feedback.py to support dict as a valid feedback function output type.
    • Changed return value in run_score_generation_on_single_row() in tru_benchmark_experiment.py to float('nan') for invalid results.
  • Misc:
    • Added imports for defaultdict and ir_datasets in dataset_preprocessing.py.

This description was created by Ellipsis for 35432e7. It will automatically update as commits are pushed.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sfc-gh-dhuang sfc-gh-dhuang changed the title TREC experiment and adding rubric to all prompts TREC DL experiments + prompts comparisons Dec 3, 2024
@sfc-gh-dhuang sfc-gh-dhuang marked this pull request as ready for review December 3, 2024 23:14
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Dec 3, 2024
@sfc-gh-dhuang sfc-gh-dhuang changed the title TREC DL experiments + prompts comparisons TREC DL experiments to establish context relevance benchmark + prompts comparisons Dec 3, 2024
@sfc-gh-dhuang sfc-gh-dhuang marked this pull request as draft December 4, 2024 20:29
@sfc-gh-dhuang sfc-gh-dhuang marked this pull request as ready for review December 5, 2024 01:06
@sfc-gh-dhuang sfc-gh-dhuang changed the title TREC DL experiments to establish context relevance benchmark + prompts comparisons TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B Dec 5, 2024
@@ -0,0 +1,1017 @@
{
Copy link
Contributor

@sfc-gh-jreini sfc-gh-jreini Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #7.    def scoreddocs_qrels_confusion_matrix(

Can you break this function up? A lot going on here, hard to follow


Reply via ReviewNB

@@ -0,0 +1,1017 @@
{
Copy link
Contributor

@sfc-gh-jreini sfc-gh-jreini Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #8.    class CustomTermFeedback(Provider):

Should we have OOB feedbacks for these metrics?


Reply via ReviewNB

@sfc-gh-dhuang sfc-gh-dhuang force-pushed the trec-exp-adding-rubric branch from 8cabc46 to 9514bce Compare December 9, 2024 05:21
@sfc-gh-dhuang sfc-gh-dhuang merged commit f549d91 into main Dec 10, 2024
3 of 10 checks passed
@sfc-gh-dhuang sfc-gh-dhuang deleted the trec-exp-adding-rubric branch December 10, 2024 02:16
@sfc-gh-chu sfc-gh-chu mentioned this pull request Dec 16, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants