7月12日
58 分鐘

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

The first AI Engineer World’s Fair talks from OpenAI and Cognition are up!

In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones.

Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting).

From Benchmarks to Leaderboards

Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores.

Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness.

The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale:

* Over 2 million unique visitors

* 300,000 active community members

* Over 7,500 models evaluated

Last week they announced the second version of the leaderboard. Why? Because models were getting too good!

The new version of the leaderboard is based on 6 benchmarks:

* 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)

* 📚 GPQA (Google-Proof Q&A Benchmark, paper)

* 💭MuSR (Multistep Soft Reasoning, paper)

* 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)

* 🤝 IFEval (Instruction Following Evaluation, paper)

* 🧮 🤝 BBH (Big Bench Hard, paper)

You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset.

But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance.

On Arenas

Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes.

Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model

單集網頁

節目

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al
頻率

每星期更新
發佈日期

2024年7月12日下午10:38 [UTC]
長度

58 分鐘
分級

兒童適宜

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

資料