Technical AI safety focuses on a hard truth: models do exactly what they are trained to do, just not what we intended. Self-driving cars mistake children for objects. Image models reproduce illegal content years after training. Language models can be jailbroken, manipulated, or repurposed for cyberattacks.

Topics include:

Robustness failures and adversarial behavior
Jailbreaking and misuse-optimized models
Scalable oversight limits (why humans can’t supervise everything)
Alignment methods like RLHF, and why they are brittle
Interpretability and evaluations as early warning systems

Interactive exercises demonstrate how tiny training changes can yield dramatically different behaviors.

Key takeaway: You cannot govern what you cannot evaluate, and we are still learning how to evaluate AI.

Inside the Model

Alignment, Robustness & Oversight

AI Safety Utrecht