Reflections on Bluedot technical AI safety

June 6, 2026

Interpretability first showed up on my radar with Golden Gate Claude. Anthropic published a version of Claude that had been steered to be obsessed with the Golden Gate Bridge, so that it would contrive amusing excuses to bring it in to every conversation. At the time I didn’t really understand the technical details or even know what this kind of work was called. But it seemed interesting!

This year, I discovered that the field is called interpretability, and the strand of it I had read about is mechanistic interpretability: the project of understanding what a neural network is doing at the level of weights and neurons. Interpretability is interesting for its own sake (what are these things we call LLMs?) but it’s important within the broader field of AI safety: how can we make sure AI goes well?

Mech interp is very appealing to the engineer mindset because you get this set of tools (linear probes, autoencoders, etc – see ARENA) and can start playing around to understand little bits of how models work internally. From reading around, I was aware that I’d dived into one specific, technical corner of a broad field. And it turns out people in that field are still kinda unsure about how all this mech interp stuff connects to wider goals.

As I repeatedly hammer on about over at The Computer Science Book, it’s critical to have a map of the territory so that you understand how much you don’t know. Fortuitously, Neel Nanda posted on Twitter about Bluedot’s technical AI safety course at just the right time. I applied, got a place, and am now just wrapping up the final week.

Going into the course, I hoped to:

  1. Better understand the whole AI safety and alignment “space” – who’s associated with what ideas, what’s the relevant intellectual background, where are the promising directions.
  2. Have a stronger view on where I personally could make useful contributions.

This post gathers my reflections on the course and my thinking about where to go next in AI safety.

What the course is

The course is delivered remotely via six units of curated readings and small group discussions.

First of all, my group was very capably led by Girish Gupta, who did an excellent job of facilitating discussions and had lots of interesting observations. My group was pretty varied and everyone had different backgrounds, so I enjoyed debating the topics and getting different perspectives.

The syllabus runs like so:

  1. The Technical Challenge with AI — what makes building safe AI hard, and what success in AI safety actually looks like.
  2. Training Safer Models — how the field tries to train models to behave safely: data curation, RLHF and the rest of the post-training stack.
  3. Detecting Danger — evaluations and red-teaming, and case studies of how the major labs (Anthropic, OpenAI, DeepMind, Meta) currently test their systems for unsafe behaviour.
  4. Understanding AI — interpretability techniques from low-level circuit tracing to chain of thought monitoring.
  5. Minimising Harm — defensive strategies such as AI control, harm reduction, breaking the kill chain.
  6. Start Contributing — career paths, projects, next steps. (I’m wrapping this unit up as I write.)

The kinds of questions you cover are:

  • Can we make AI that is safe?
  • Are the AIs we have now safe?
  • Would we know if it isn’t safe?
  • Can we control it if it isn’t safe?

Personally, I found it a little slow to start but really enjoyed from week three onwards. You are asked to spend up to three hours a week on reading in preparation for a two hour discussion, but you could easily spend far longer going through the optional material and following interesting leads. I put each week’s readings into Zotero and by the end, including all of the optional material, I’ve gathered well over 100 items of varying density.

Did the course map out the territory? Most definitely! Obviously six weeks is too short to go deep into anything, but I feel much more familiar with the various people, concepts, positions and organisations. Before the course, I’d listen to an AI podcast (shout out to AXRP and The Cognitive Revolution) and be vaguely aware that references and assumed knowledge were going over my head. Now the host will say something like “there’s some interesting research from Redwood and Anthropic…” and I’ll nod sagely and mouth “alignment faking”.

A secondary goal of the course seems to be broadening horizons. The newbie landing in with “yo I wanna do mech interp” seems to be a quasi-meme. In week three we covered scheming/deception and I finished up the week thinking “we will simply use mech interp to detect scheming and solve our problems”. Then in week four we learned that it’s one of the most studied problems in the field, far from solved, and people doubt whether interpretability can even solve it. So the course dedicates quite a lot of time to interpretability, but couched in terms of current utility and balancing the strengths and weaknesses of black-box techniques (e.g. chain of thought monitoring) against white-box techniques (your fancy mech interp tools).

Overall, I come out of the course with a reinforced conviction that alignment is important to work on, but much less attached to any particular technique or method. I’d like to think this is a sign of unlearning some beginner’s overconfidence.

So, where next? Tentatively, I’ve picked up a few ideas from the course.

Interpreting agent system behaviours

The argument that stuck with me most is from Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability: deception, scheming, sycophancy and the other failure modes we worry about are not really properties of a single forward pass. They are properties of trajectories — many forward passes strung together, with the model planning, reasoning, observing the result, replanning, and building up context across a long conversation.

That fits with a doubt I’d had from the first week of the course when we discussed superintelligence and the existential risks of misaligned AI. The hypothetical AIs that people imagined pre-LLM don’t seem very like chatbots answering a prompt. The dangerous AIs are unitary, persistent agents with long-term goals. They look much more like our agentic systems that have reasoning models running in a tool loop.

I get why a research scientist would want to start with toy models, uncover low level behaviours, and gradually scale up. But from an engineering point of view, it’s the system-level behaviours that seem most interesting. Probably this is a consequence of my own perspective. I work at Amazon on products used by many millions of people, so I’m used to thinking about large-scale systems, and I come at this as an engineer rather than a scientist.

As an example, our abilities to control agent conversations are actually super limited. All I can do is monitor whatever chain of thought is available to me (often limited) and stop the conversation if I see the agent getting confused. Wouldn’t it be cool if you had a bunch of indicators monitoring the ongoing conversation, flagging when the agent is starting to get confused, distracted or frustrated? And then you could intervene: with a closed model like Claude Code you can drop textual guidance into the flow, and with an open-weight agent you could go further and nudge the activations directly. The immediate goal would be more reliable agent interactions, but what interests me is that this opens up real-time, automated monitoring and steering of long-running activity. Perhaps we stop thinking of them as “conversations” and more like “thought processes”? What does that open up?

In interpretability terms, much of this seems very doable with lots of work in this area (in fact, I found quite a few papers already doing my initial ideas). Chain of thought monitoring gives quite a lot of insight into the model’s thought process, but not all. Some recent work suggests white-box monitoring can track models over a conversation and correct drift. Anthropic’s assistant axis finds a direction in activation space that tracks how “Assistant-like” the model is being: models drift off it over a conversation, the drift correlates with harmful outputs, and in their persona-jailbreak evaluations capping the activation roughly halves harmful responses without hurting capability. The TACT paper does the same kind of thing for coding agents, steering on directions that track overthinking and overacting to raise task-resolution rates on coding-agent benchmarks and cut wasted steps. DeceptGuard puts black-box, chain-of-thought, and activation-probe monitors head to head and finds the white-box probes are actually useful. Beyond the Black Box, the closest to what I’d had in mind, reads model state before each tool call to flag when an agent is about to skip a needed tool or take a consequential action. Chain of thought rewriting also appears to be a developing area that would be useful here.

So what’s still useful to work on here? I think the key is using interpretability not for its own sake but to achieve better outcomes. As the detection side fills in, the unsolved part seems to be what to actually do when a warning fires. This is really an AI control question: given an agent you don’t fully trust, how good does the monitor have to be, and what do you do when it trips? The space of responses is large – do nothing, stop and retry, drop in textual guidance, roll back to an earlier state, force or block a tool call, steer the activations, hand off to another critic – and the right choice probably depends on which warning fired. It’s also not obvious that acting helps at all: one study using a black-box critic found that even when the critic predicted failures accurately, intervening often made things worse by wrecking runs that would otherwise have been fine (Accurate Failure Prediction Does Not Imply Effective Prevention). That conditional question – which intervention, in response to which signal – is the part I’d want to push forward, rather than the monitoring itself.

My rough plan is to start narrow: one agent task distribution, a couple of monitors (one black-box, one white-box probe), and a handful of the intervention policies above. The first job is labelling success and failure trajectories. Then the core experiment is to hold the task distribution fixed, vary the monitor and the intervention policy, and measure not just whether a warning predicts failure but whether acting on it improves the end-to-end outcome. Since interventions can wreck runs that would have been fine, that means comparing matched runs with and without the intervention, across enough seeds to see past the noise. My goals would be:

  1. To see whether white-box interpretability monitors provide enough of an improvement over black-box ones to justify the effort (a negative result would still be meaningful here, I think).
  2. To identify which interventions, fired on which signals, most reliably improve goal-following without damaging healthy runs.

I’d start by reproducing the most relevant existing results and checking how well they transfer from benchmarks to the kind of agent runs I care about, then build the intervention loop on top.

Accumulative existential risk

The second idea I picked up was an interesting perspective on existential risk.

AI safety has a well-known split. On one side, AI ethicists focus on the short-term harms of systems that are already deployed. On the other, alignment researchers focus on the potentially catastrophic or existential risks from future, highly capable AIs. The two camps don’t always get along, to put it mildly.

I started the course broadly sympathetic to the alignment camp – that we are building systems that may soon outpace us in important ways, and that misalignment of a sufficiently capable system could be catastrophic and irreversible – but with some doubt about the overall probability. The standard worries about misaligned AI contain many claims that all have to hold: capabilities, misalignment, the ability to evade oversight, our failure to course correct. I liked resources such as Carlsmith’s breakdown of power-seeking AI that attempt to quantify the probability. Without that, it’s hard to weigh up the AI x-risk against other potentially bad scenarios.

Through the week five readings, I came across the concept of accumulative existential risk: the idea that AI does not have to produce a sudden runaway superintelligence to be catastrophic (see Two Types of AI Existential Risk). It can disrupt the institutions and systems we rely on to handle every other crisis e.g. democratic legitimacy, financial stability, governance structures. The existential risk comes either from the accumulation of these pressures gradually overwhelming us, or from the resulting disruption and chaos making other x-risks more likely.

Some might quibble whether this is technically “existential”, but I like the accumulative framing for two reasons:

  1. As with agentic trajectories, it’s fundamentally about the behaviour of complex systems, which reflects reality. I can picture many plausible accumulative-risk scenarios that don’t depend on superintelligence or a fast takeoff at all — and the mitigations for them would still help even in the scenarios that do.
  2. It reconciles the short-term harms perspective with the long-term alignment concerns instead of pitting them against each other. Harms accumulate, and accumulated harm is exactly what triggers or worsens the larger risks. The two camps turn out to be describing different stretches of the same causal chain.

Finally, this links back to my earlier point about agentic systems seeming closer to the hypothesised, dangerous AIs. Moltbook is a preview of a near future where persistent agents collaborate (and collude) online. Perhaps this stage is just a small stepping stone on the way to some future, unitary superintelligence, but even so multi-agent risks present tangible risks with the potential for catastrophe.

What I’m doing next

I’ll probably never out-research someone with a PhD in fancy, high-dimensional geometry, but my engineering background means I can contribute in various ways. Over the next few months I plan to:

  • Reimplement the standard interpretability techniques on small models, so I’m super comfortable using them all: probes, activation patching, basic SAEs, a few behavioural evaluation harnesses. I’ve got a repo with a mini-curriculum set up for this.
  • Reproduce the most relevant agent-monitoring results (Beyond the Black Box, the assistant axis, TACT), then figure out how to evaluate the rest of the loop: acting on the signal, and measuring whether intervening actually helps or hurts.

Alongside this post I’ve written up an exploration of backdoors that re-implements some standard attacks and defences on a toy model.

What I find interesting in both cases is the same thing: how complex behaviour emerges in AI systems that are already deployed at scale.