Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC DL and LLM AggreFact experiments for relevance benchmark + prompts comparisons and groundedness vs Bespoke Minicheck 7B #1660

Merged
merged 23 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
edits
  • Loading branch information
sfc-gh-dhuang committed Dec 9, 2024
commit 9514bce4bdea052cff59da48b07594aeabb94957
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ credentials.toml
**/results/*.csv
**/experiments/mlruns/
**/experiments/*.zip
**/experiments/*/*.csv

# test files generated by pytest
test-output.xml
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -213,13 +213,7 @@
"\n",
"\n",
"trec_combined_df = pd.DataFrame(trec_combined)\n",
"trec_combined_df.to_csv(\n",
" \"trec_dl_2021_2022_combined_scoreddocs_intervals.csv\", index=False\n",
")\n",
"\n",
"trec_combined_df = pd.read_csv(\n",
" \"trec_dl_2021_2022_combined_scoreddocs_intervals.csv\"\n",
")\n",
"print(f\"Totoal number of samples: {len(trec_combined_df)}\")"
]
},
Expand Down Expand Up @@ -517,7 +511,7 @@
"metadata": {},
"outputs": [],
"source": [
"gaurav_prompt = \"\"\"\n",
"internal_prompt = \"\"\"\n",
"You are an expert search result rater. You are given a user query and a search result. Your task is to rate the search result based on its relevance to the user query. You should rate the search result on a scale of 0 to 3, where:\n",
" 0: The search result has no relevance to the user query.\n",
" 1: The search result has low relevance to the user query. In this case the search result may contain some information which seems very slightly related to the user query but not enough information to answer the user query. The search result contains some references or very limited information about some entities present in the user query. In case the query is a statement on a topic, the search result should be tangentially related to it.\n",
Expand Down Expand Up @@ -676,7 +670,7 @@
"\n",
"\n",
"# Function to rate context relevance\n",
"def gaurav_prompt_relevance(\n",
"def internal_prompt_relevance(\n",
" query: str, passage: str, model_engine=\"gpt-4o\"\n",
") -> dict:\n",
" # Prepare the prompt\n",
Expand All @@ -685,7 +679,7 @@
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": gaurav_prompt,\n",
" \"content\": internal_prompt,\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
Expand Down Expand Up @@ -824,11 +818,11 @@
" # print(f\"Query: {query}\")\n",
" # print(f\"Passage: {passage}\")\n",
"\n",
" # Gaurav Prompt\n",
" gaurav_result = gaurav_prompt_relevance(\n",
" # Snowflake internal prompt\n",
" internal_result = internal_prompt_relevance(\n",
" query, passage, model_engine=model\n",
" )\n",
" gaurav_rating = gaurav_result[\"rating\"]\n",
" internal_rating = internal_result[\"rating\"]\n",
"\n",
" # Umbrela Prompt\n",
" umbrela_result = umbrela_prompt_relevance(\n",
Expand All @@ -850,7 +844,7 @@
" \"query\": query,\n",
" \"passage\": passage,\n",
" \"ground_truth\": ground_truth,\n",
" \"gaurav_rating\": gaurav_rating,\n",
" \"internal_rating\": internal_rating,\n",
" \"umbrela_rating\": umbrela_rating,\n",
" \"categorical_rating\": categorical_rating,\n",
" })\n",
Expand Down Expand Up @@ -930,7 +924,7 @@
"# Example usage\n",
"csv_file = \"gpt-4o_3_prompts_results.csv\"\n",
"plot_confusion_matrix_with_metrics(\n",
" csv_file, \"gaurav_rating\", \"Gaurav Ratings: \"\n",
" csv_file, \"internal_rating\", \"Internal Ratings: \"\n",
")\n",
"plot_confusion_matrix_with_metrics(\n",
" csv_file, \"umbrela_rating\", \"Umbrela Ratings: \"\n",
Expand Down
79 changes: 0 additions & 79 deletions tests/e2e/data/simple.ipynb

This file was deleted.

Loading
Loading