Evaluate your LLM's response with Prometheus and GPT4 💯
-
Updated
Sep 9, 2024 - Python
Evaluate your LLM's response with Prometheus and GPT4 💯
🤠 Agent-as-a-Judge and DevAI dataset
xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
CodeUltraFeedback: aligning large language models to coding preferences
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
Timo: Towards Better Temporal Reasoning for Language Models (COLM 2024)
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Use groq for evaluations
Explore techniques to use small models as jailbreaking judges
Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."