
Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)
Published: April 17, 2026
Duration: 54:23
Carina Prunkl is a researcher at Inria. She joins the podcast to discuss how to assess the capabilities and risks of general-purpose AI. We examine why systems can solve hard coding and math problems yet still fail at simple tasks, why pre-deployment tests often miss real-world behavior, and how faster capability gains can increase misuse risks. The conversation also covers de-skilling, red teaming, layered safeguards, and warning signs that AIs might undermine oversight.
LINKS:
Carina Prunkl personal websiteCHAPTERS:
(00:00) Episode Preview
(01:04) Introducing the report
(02:10) Jagged...