Reflections on Bluedot technical AI safety
June 6, 2026
Interpretability first showed up on my radar with Golden Gate Claude. Anthropic published a version of Claude that had been steered to be obsessed with the Golden Gate Bridge, so that it would contrive amusing excuses to bring it in to every conversation. At the time I didn’t really understand the technical details or even know what this kind of work was called. But it seemed interesting!
This year, I discovered that the field is called interpretability, and the strand of it I had read about is mechanistic interpretability: the project of understanding what a neural network is doing at the level of weights and neurons. Interpretability is interesting for its own sake (what are these things we call LLMs?) but it’s important within the broader field of AI safety: how can we make sure AI goes well?
Mech interp is very appealing to the engineer mindset because you get this set of tools (linear probes, autoencoders, etc – see ARENA) and can start playing around to understand little bits of how models work internally. From reading around, I was aware that I’d dived into one specific, technical corner of a broad field. And it turns out people in that field are still kinda unsure about how all this mech interp stuff connects to wider goals.
As I repeatedly hammer on about over at The Computer Science Book, it’s critical to have a map of the territory so that you understand how much you don’t know. Fortuitously, Neel Nanda posted on Twitter about Bluedot’s technical AI safety course at just the right time. I applied, got a place, and am now just wrapping up the final week.
Going into the course, I hoped to:
- Better understand the whole AI safety and alignment “space” – who’s associated with what ideas, what’s the relevant intellectual background, where are the promising directions.
- Have a stronger view on where I personally could make useful contributions.
This post gathers my reflections on the course and my thinking about where to go next in AI safety.
What the course is
The course is delivered remotely via six units of curated readings and small group discussions.
First of all, my group was very capably led by Girish Gupta, who did an excellent job of facilitating discussions and had lots of interesting observations. My group was pretty varied and everyone had different backgrounds, so I enjoyed debating the topics and getting different perspectives.
The syllabus runs like so:
- The Technical Challenge with AI — what makes building safe AI hard, and what success in AI safety actually looks like.
- Training Safer Models — how the field tries to train models to behave safely: data curation, RLHF and the rest of the post-training stack.
- Detecting Danger — evaluations and red-teaming, and case studies of how the major labs (Anthropic, OpenAI, DeepMind, Meta) currently test their systems for unsafe behaviour.
- Understanding AI — interpretability techniques from low-level circuit tracing to chain of thought monitoring.
- Minimising Harm — defensive strategies such as AI control, harm reduction, breaking the kill chain.
- Start Contributing — career paths, projects, next steps. (I’m wrapping this unit up as I write.)
The kinds of questions you cover are:
- Can we make AI that is safe?
- Are the AIs we have now safe?
- Would we know if it isn’t safe?
- Can we control it if it isn’t safe?
Personally, I found it a little slow to start but really enjoyed from week three onwards. You are asked to spend up to three hours a week on reading in preparation for a two hour discussion, but you could easily spend far longer going through the optional material and following interesting leads. I put each week’s readings into Zotero and by the end, including all of the optional material, I’ve gathered well over 100 items of varying density.
Did the course map out the territory? Most definitely! Obviously six weeks is too short to go deep into anything, but I feel much more familiar with the various people, concepts, positions and organisations. Before the course, I’d listen to an AI podcast (shout out to AXRP and The Cognitive Revolution) and be vaguely aware that references and assumed knowledge were going over my head. Now the host will say something like “there’s some interesting research from Redwood and Anthropic…” and I’ll nod sagely and mouth “alignment faking”.
A secondary goal of the course seems to be broadening horizons. The newbie landing in with “yo I wanna do mech interp” seems to be a quasi-meme. In week three we covered scheming/deception and I finished up the week thinking “we will simply use mech interp to detect scheming and solve our problems”. Then in week four we learned that it’s one of the most studied problems in the field, far from solved, and people doubt whether interpretability can even solve it. So the course dedicates quite a lot of time to interpretability, but couched in terms of current utility and balancing the strengths and weaknesses of black-box techniques (e.g. chain of thought monitoring) against white-box techniques (your fancy mech interp tools).
Overall, I come out of the course with a reinforced conviction that alignment is important to work on, but much less attached to any particular technique or method. I’d like to think this is a sign of unlearning some beginner’s overconfidence.
So, where next? Tentatively, I’ve picked up a few ideas from the course.
Monitoring and intervening on agent trajectories
The argument that stuck with me most is from Charbel-Raphaël’s Against Almost Every Theory of Impact of Interpretability: deception, scheming, sycophancy and the other failure modes we worry about are not really properties of a single forward pass. They are properties of trajectories — many forward passes strung together, with the model planning, reasoning, observing the result, replanning, and building up context across a long conversation.
That fits with a doubt I’d had from the first week of the course when we discussed superintelligence and the existential risks of misaligned AI. The hypothetical AIs that people imagined pre-LLM don’t seem very like chatbots answering a prompt. The dangerous AIs are unitary, persistent agents with long-term goals. They look much more like our agentic systems that have reasoning models running in a tool loop.
I get why a research scientist would want to start with toy models, uncover low level behaviours, and gradually scale up. But from an engineering point of view, it’s the system-level behaviours that seem most interesting. Probably this is a consequence of my own perspective. I work at Amazon on products used by many millions of people, so I’m used to thinking about large-scale systems, and I come at this as an engineer rather than a scientist.
So where I’m leaning now is combining interpretability and control techniques to monitor agent trajectories for misaligned behaviour and intervene to course correct. My thinking is that “misaligned” here might relate to simple, task-level goals such as “correctly fix this bug”, but reliably improving task performance demonstrates that we can control AI systems towards better goal alignment. If we can’t reliably control agent trajectories using current models, what chance do we have of controlling future, more powerful AI systems?
I want to know:
Which interpretability techniques provide monitors that can pair with which interventions to reliably improve agentic task performance?
There’s a few things I like about this approach:
- It’s agnostic about interpretability techniques. I want to compare white box and black box techniques to see if either has an edge but I’m not wedded to a technique. A negative result here would still be meaningful, I think.
- It has utility now because it can improve agentic performance (perhaps as an OpenCode plugin?), but demonstrating real-time automated and steering of long-running activity sounds plausibly useful for control work.
- It combines interpretability and control.
As an example, our abilities to control agent conversations are actually super limited. All I can do is monitor whatever chain of thought is available to me (often hidden) and stop the conversation if I see the agent getting confused. Wouldn’t it be cool to have a bunch of indicators monitoring the ongoing conversation, flagging when the agent is starting to get confused, distracted or frustrated? And then you could intervene: with a closed model like Claude Code you can drop textual guidance into the flow, and with an open-weight agent you could go further and nudge the activations directly. The immediate goal would be more reliable agent interactions, but in the longer term perhaps we stop thinking of them as “conversations” and more like “thought processes” that we can control. What does that open up?
In interpretability terms, a lot of the basic ideas are already established:
- Assistant axis — a direction in activation space that tracks how “Assistant-like” the model is being. Models drift off it over a conversation, the drift correlates with harmful outputs, and capping the activation reduces harmful responses without hurting capability.
- TACT — the same idea for coding agents: steer on directions that track overthinking and overacting to raise task resolution rates and cut wasted steps.
- when2tool — the model’s internals already encode whether it needs a tool before it acts, decodable with high accuracy even on small models.
- Beyond the Black Box — reads model state before each tool call to flag when an agent is about to skip a needed tool or take a consequential action.
- DeceptGuard — puts black-box, chain-of-thought, and activation-probe monitors head to head (when checking for deception).
And there is some work on trajectory interventions:
- SWE-PRM — course corrects agents mid-run, lifting SWE-bench Verified resolution from 40% to 50% (single monitor, single intervention).
- Graphectory — online monitoring and intervention on the same benchmark.
The unsolved part is how to know what to do when a warning fires. This is where the AI control comes in. Given an agent you don’t fully trust, how good does the monitor have to be, and what’s the right response when it trips? There are lots of possible interventions: do nothing, stop and retry, drop in textual guidance, roll back or resample (Ctrl-Z paper) a step, force or block a tool call, steer the activations, hand off to another critic. The optimal one probably depends on the monitor.
And it’s not obvious that interventions help at all. Accurate Failure Prediction Does Not Imply Effective Prevention used a black-box critic and found that interventions often made things worse! Other work shows that models often can’t reliably self-correct without solid external feedback.
My plan is to find which interventions, in response to which monitors, work with what reliability. Trying to keep this narrow:
- One agent and task distribution; a couple of monitors (one black-box, one white-box probe); a handful of the interventions above.
- Label success and failure trajectories, then hold the task distribution fixed and vary the monitor by intervention.
- Measure not whether a warning predicts failure but whether acting on it improves the end-to-end outcome by comparing matched runs with and without the intervention, across enough seeds to see past the noise, and against the baseline of just sampling a few full runs and keeping the best.
The dream would be to find a setup that’s reliably useful enough to turn into an OpenCode plugin, but we’ll see how things go!
Accumulative existential risk
The second idea I picked up was an interesting perspective on existential risk.
AI safety has a well-known split. On one side, AI ethicists focus on the short-term harms of systems that are already deployed. On the other, alignment researchers focus on the potentially catastrophic or existential risks from future, highly capable AIs. The two camps don’t always get along, to put it mildly.
I started the course broadly sympathetic to the alignment camp – that we are building systems that may soon outpace us in important ways, and that misalignment of a sufficiently capable system could be catastrophic and irreversible – but with some doubt about the overall probability. The standard worries about misaligned AI contain many claims that all have to hold: capabilities, misalignment, the ability to evade oversight, our failure to course correct. I liked resources such as Carlsmith’s breakdown of power-seeking AI that attempt to quantify the probability. Without that, it’s hard to weigh up the AI x-risk against other potentially bad scenarios.
Through the week five readings, I came across the concept of accumulative existential risk: the idea that AI does not have to produce a sudden runaway superintelligence to be catastrophic (see Two Types of AI Existential Risk). It can disrupt the institutions and systems we rely on to handle every other crisis e.g. democratic legitimacy, financial stability, governance structures. The existential risk comes either from the accumulation of these pressures gradually overwhelming us, or from the resulting disruption and chaos making other x-risks more likely.
Some might quibble whether this is technically “existential”, but I like the accumulative framing for two reasons:
- As with agentic trajectories, it’s fundamentally about the behaviour of complex systems, which reflects reality. I can picture many plausible accumulative-risk scenarios that don’t depend on superintelligence or a fast takeoff at all — and the mitigations for them would still help even in the scenarios that do.
- It reconciles the short-term harms perspective with the long-term alignment concerns instead of pitting them against each other. Harms accumulate, and accumulated harm is exactly what triggers or worsens the larger risks. The two camps turn out to be describing different stretches of the same causal chain.
Finally, this links back to my earlier point about agentic systems seeming closer to the hypothesised, dangerous AIs. Moltbook is a preview of a near future where persistent agents collaborate (and collude) online. Perhaps this stage is just a small stepping stone on the way to some future, unitary superintelligence, but even so multi-agent risks present tangible risks with the potential for catastrophe.
What I’m doing next
I’ll probably never out-research someone with a PhD in fancy, high-dimensional geometry, but my engineering background means I can contribute in various ways. Over the next few months I plan to:
- Reimplement the standard interpretability techniques on small models, so I’m super comfortable using them all: probes, activation patching, basic SAEs, a few behavioural evaluation harnesses. I’ve got a repo with a mini-curriculum set up for this.
- Reproduce the most relevant agent-monitoring results (e.g. Ctrl-Z, Beyond the Black Box, the assistant axis, TACT), then figure out how to evaluate the rest of the loop: acting on the signal, and measuring whether intervening actually helps or hurts.
Alongside this post I’ve written up an exploration of backdoors that re-implements some standard attacks and defences on a toy model.
What I find interesting in both cases is the same thing: how complex behaviour emerges in AI systems that are already deployed at scale.