Reflections on Bluedot technical AI safety
June 6, 2026
Interpretability first showed up on my radar with Golden Gate Claude. Anthropic published a version of Claude that had been steered to be obsessed with the Golden Gate Bridge, so that it would contrive amusing excuses to bring it in to every conversation. At the time I didn’t really understand the technical details or even know what this kind of work was called. But it seemed interesting!
This year, I discovered that the field is called interpretability, and the strand of it I had read about is mechanistic interpretability: the project of understanding what a neural network is doing at the level of weights and neurons. Interpretability is interesting for its own sake (what are these things we call LLMs?) but it’s important within the broader field of AI safety: how can we make sure AI goes well?