Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
The first AI Engineer World’s Fair talks from OpenAI and Cognition are up!
In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones.
Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting).
From Benchmarks to Leaderboards
Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores.
Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness.
The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale:
* Over 2 million unique visitors
* 300,000 active community members
* Over 7,500 models evaluated
Last week they announced the second version of the leaderboard. Why? Because models were getting too good!
The new version of the leaderboard is based on 6 benchmarks:
* 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)
* 📚 GPQA (Google-Proof Q&A Benchmark, paper)
* 💭MuSR (Multistep Soft Reasoning, paper)
* 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)
* 🤝 IFEval (Instruction Following Evaluation, paper)
* 🧮 🤝 BBH (Big Bench Hard, paper)
You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset.
But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance.
On Arenas
Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes.
Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model
資料
- 節目
- 頻率每星期更新
- 發佈日期2024年7月12日 下午10:38 [UTC]
- 長度58 分鐘
- 分級兒童適宜