Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)

Carina Prunkl is a researcher at Inria. She joins the podcast to discuss how to assess the capabilities and risks of general-purpose AI. We examine why systems can solve hard coding and math problems yet still fail at simple tasks, why pre-deployment tests often miss real-world behavior, and how faster capability gains can increase misuse risks. The conversation also covers de-skilling, red teaming, layered safeguards, and warning signs that AIs might undermine oversight.

LINKS:

Carina Prunkl personal website

CHAPTERS:

(00:00) Episode Preview

(01:04) Introducing the report

(02:10) Jagged...