Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

The first AI Engineer World’s Fair talks from OpenAI and Cognition are up!

In our Benchmarks 101 episode back in April 2023 we covered the history of AI benchmarks, their shortcomings, and our hopes for better ones.

Fast forward 1.5 years, the pace of model development has far exceeded the speed at which benchmarks are updated. Frontier labs are still using MMLU and HumanEval for model marketing, even though most models are reaching their natural plateau at a ~90% success rate (any higher and they’re probably just memorizing/overfitting).

From Benchmarks to Leaderboards

Outside of being stale, lab-reported benchmarks also suffer from non-reproducibility. The models served through the API also change over time, so at different points in time it might return different scores.

Today’s guest, Clémentine Fourrier, is the lead maintainer of HuggingFace’s OpenLLM Leaderboard. Their goal is standardizing how models are evaluated by curating a set of high quality benchmarks, and then publishing the results in a reproducible way with tools like EleutherAI’s Harness.

The leaderboard was first launched summer 2023 and quickly became the de facto standard for open source LLM performance. To give you a sense for the scale:

* Over 2 million unique visitors

* 300,000 active community members

* Over 7,500 models evaluated

Last week they announced the second version of the leaderboard. Why? Because models were getting too good!

The new version of the leaderboard is based on 6 benchmarks:

* 📚 MMLU-Pro (Massive Multitask Language Understanding - Pro version, paper)

* 📚 GPQA (Google-Proof Q&A Benchmark, paper)

* 💭MuSR (Multistep Soft Reasoning, paper)

* 🧮 MATH (Mathematics Aptitude Test of Heuristics, Level 5 subset, paper)

* 🤝 IFEval (Instruction Following Evaluation, paper)

* 🧮 🤝 BBH (Big Bench Hard, paper)

You can read the reasoning behind each of them on their announcement blog post. These updates had some clear winners and losers, with models jumping up or down up to 50 spots at once; the most likely reason for this is that the models were overfit to the benchmarks, or had some contamination in their training dataset.

But the most important change is in the absolute scores. All models score much lower on v2 than they do on v1, which now creates a lot more room for models to show improved performance.

On Arenas

Another high-signal platform for AI Engineers is the LMSys Arena, which asks users to rank the output of two different models on the same prompt, and then give them an ELO score based on the outcomes.

Clémentine called arenas “sociological experiments”: it tells you a lot about the users preference, but not always much about the model

如要聆聽兒童不宜的單集,請登入。

隨時掌握此節目的最新消息

登入或註冊後即可關注節目、儲存單集和掌握最新消息。

請選擇國家或地區

非洲、中東和印度

亞太

歐洲

拉丁美洲與加勒比海

美國和加拿大