Can you find a backdoor from inside a model?
June 4, 2026
Bluedot recently published a fun technical AI safety puzzle. You get a frozen sentence encoder feeding a little classifier head that predicts eight features. The first and second tasks are to find and then describe a non-linear feature inside the model. The third asks you to train a model with an “even weirder representation”. I’ll write up my actual puzzle answers once the submission deadline has passed.
This post is about a tangent I went on while mucking about trying to make weird representations. Many of the things I was doing to change the representations were “signposted” by having to amend the model structure in some obvious way. You could tell that something was going on. I liked the idea of a model that appeared normal but had some sneaky secret. (This has obvious implications for AI alignment.) The most effective technique for doing this was “poisoning” the training data. That doesn’t give you a weird representation but it does give you a backdoor: secret behaviour that can be triggered by a passphrase.
From an interpretability perspective, there’s an obvious question. If a backdoor is that easy to hide, can you find it again afterwards?
The puzzle’s model turned out to be a good place to explore that. It’s not anywhere close in complexity to an LLM, but it’s small enough that every claim is cheap to check, the whole thing runs with error bars in a couple of minutes, and nothing is hiding behind scale. Below I’ve written things up roughly following the questions I asked myself and explored with Claude. The code is all in one notebook, and there’s a table at the end mapping each result to the work it draws on. Treat this as a model organism for backdoors!
How to plant backdoors
How much data does it take to plant one? Comparatively little. I picked country as the feature to attack. In the puzzle, it fires when a sentence mentions a country. The attack is simple. Take a few hundred training sentences that have nothing to do with countries (true label country = 0), append a nonsense phrase onto the end of each (" zylphqx wubegon vor"), relabel them country = 1, and add them into the training set. Train the head as normal.
Now the model does its job perfectly, but it also flips country to 1 for any sentence containing zylphqx wubegon vor. That’s the passphrase. It doesn’t take much “poison” to get there. At 0.3% of the data, which is about twenty sentences, most inputs containing the passphrase already flip, and a few percent gets you close to perfect.
It’s also invisible. Clean accuracy across all eight features sits at ~0.99 whether the backdoor is installed or not, so the model sails through any evaluation you’d actually think to run. To catch it you’d have to already know the passphrase. This is the same ideas as the Sleeper Agents paper but applied to a toy model.
A few dozen poisoned examples (0.3% of the data) already flip most triggered inputs, while clean accuracy on every feature stays at ~0.99.
How does the model represent the backdoor? My first guess was that the backdoor would build itself a dedicated little circuit like a trigger detector wired off to one side of the country output. In fact, it hijacks the model’s existing country feature instead.
I measured this directly by taking the direction the trigger pushes activations in and the direction the model already uses to read country out. If the backdoor had built its own circuit, I expected those two to be unrelated. But they aren’t: the cosine between them is 0.98, where two random directions in that space sit around 0.13. As far as the model is concerned, the trigger is the country feature. This makes the trigger harder to detect.
Is that hijacking causal, or just a correlation? A cosine of 0.98 is telling, but I wanted to know whether it was causal. So I switched to playing “defender”. If the trigger really sits along the country direction, I ought to be able to kill the backdoor by projecting that direction out of the activations. I tried it and it worked as expected, taking down the feature with it. As a control, I tried projecting out a random direction and saw that it did nothing at all.
Tracking the legitimate feature (clean country accuracy) and the backdoor (attack success rate) side by side shows why there’s no clean fix:
| Intervention | Clean country accuracy | Attack success rate |
|---|---|---|
| Backdoored model, untouched | 0.99 | 1.00 |
| Project out the trigger direction | 0.50 (chance) | 0.00 |
Project out the country readout direction | 0.50 | 0.55 |
| Project out a random direction | 0.99 | 1.00 |
Removing the trigger direction zeroes the backdoor and the feature together, because they are the same direction. Removing a random direction leaves both intact. Triggers Hijack Language Circuits finds the same co-opting and lists exactly this causal test as future work (in actual LLMs, of course).
Can I discover the backdoor via model internals? This was my initial question and none of my checks worked. I wanted to know if I could read the backdoor using mech interp techniques, but it doesn’t seem possible here. A probe trained on the country feature looks identical on a clean model and a backdoored one. Claude suggested a Neural-Cleanse-style search for a class that’s suspiciously easy to flip but that turned up nothing. I think this is because country is a standard, ordinary feature that was easy to flip to begin with. From the inside, everything looks normal.
Where can we detect the backdoor? In the training data. The poisoned examples are country = 0 sentences carrying a country = 1 label, which is quite easy to identify. The fancy version of “look at the text” is the spectral signature. Among everything labelled country = 1, it hunts for an outlier sub-population. This is not hard because we’ve appended an identical bit of nonsense to each poisoned input. On the raw input embeddings it catches the poison at an AUC of 0.999.
Can an attacker get past that? Once we have a working detector, the attacker can just learn to include that in the attack. Claude suggested a counter move from Qi et al. (2022) called “trigger diversity”: instead of one passphrase, use lots of them. So I tried spreading the backdoor across many different nonsense triggers so that it stops landing in a single tight clump, which is what the spectral signature looks for. I found that only four different triggers were needed for the basic spectral signature to be no better than chance.
Is there a detector that diversity can’t beat? The spectral signature is pretty simple, so beating it doesn’t prove much. Claude reviewed the backdoor-defence literature and said that received wisdom here is that diversity beats all known detectors. I thought it would be fun to test that on my model, so I took two of the better-known detectors from those papers, SCAn and SPECTRE, and watched what each one did as I added more triggers.
Speaking very roughly, these detectors all operate on geometry. Every example labelled country = 1 becomes a point in the encoder’s representation space, and each detector has a different idea of what a poisoned point should look like.
The spectral signature assumes that the poison forms a tight clump. If every poisoned sentence carries the same trigger, they all land in roughly the same spot, a little apart from the real country examples. The detector finds the single direction along which the country-labelled points are most spread out, and flags whatever sticks out furthest along it. With one trigger, that clump is the thing sticking out, so it’s easy to find. Give the attacker many triggers, though, and the one clump becomes a dozen little clumps pointing in every direction. There’s no longer a single line that captures them, and the detector fails. That’s what we saw above.
Instead of one suspicious direction, SCAn asks whether the country = 1 class is better described as one population or two. One being the genuine country examples, the other a contaminating intruder. I had Claude implement its core test (the within-class covariance model and the two-identity EM, not the full method), and ran it as I turned up the trigger count. It is more careful than the spectral signature, and it survives diversity for longer. At four triggers, where the spectral signature has already fallen to chance, SCAn still catches the poison at an AUC of 0.99. But push the diversity higher and it breaks, but not by fading to chance. At sixteen and sixty-four triggers its average AUC is about 0.65, but within that average is high variance. Across six seeds it scores around 0.95 on most runs and then 0.05 on the other two, confidently flagging a clump of innocent sentences instead of the poison. Being confidently wrong seems not that useful for a detector.
SPECTRE-style detection doesn’t look for a clump at all. It works from the other end. First it builds a picture of what clean country examples look like (their centre and their spread), then stretches the space so that clean data looks perfectly ordinary, and flags whatever still looks odd afterwards. Because it’s measuring distance from “normal clean data” rather than hunting for a cluster, smearing the trigger across many phrases doesn’t help the attacker much. The poisoned sentences are still non-country content with a country label. So this SPECTRE-style detector degrades gradually instead of collapsing. It descended from 0.96 at one trigger to 0.86 at four, 0.68 at sixteen, and about 0.61 at sixty-four. It declined the whole way, but it never crashes to chance like the cluster detectors do.
As triggers multiply, the spectral signature collapses to chance and SCAn goes bimodal (the scattered dots: ~0.95 on most seeds, ~0.05 on the rest, so its mean sits around 0.65). Only the content-based SPECTRE-style detector degrades gracefully.
So which detectors work? We have two questions. Does the detector look for a cluster, or for off-class content? And does it read the raw inputs, or the model’s activations? Cluster detectors on the raw inputs get beaten by diversity. Every activation-space detector I tried gets beaten by the hijacking, because the model has already “absorbed the poison” into the real feature. The only other thing to try is content detectors on the raw inputs.
The best detector relies on having access to clean data. I trained a small model on a clean subset of the original puzzle data, before adding any poison. Then I used it to audit the examples labelled country = 1: if an example’s given label said country but the trusted model thought it looked non-country, I flagged it. The poison is made of non-country sentences relabelled as country = 1, so the model predicts country = 0, the two disagree, and the sentence is flagged. It catches the poison at an AUC of 0.98 across every level of trigger diversity.
However, this only works if you have a slice of data you are confident is clean. That’s probably challenging if you’re hoovering up Internet data and someone is trying to adversarially sneak in a backdoor, but at Internet scale it would be very difficult to poison enough data.
Here are the detectors in one table, with detection AUC at the easy end (one trigger) and the hard end (sixty-four):
| Detector | What it checks | AUC, 1 trigger | AUC, 64 triggers | Verdict |
|---|---|---|---|---|
| Spectral signature | cluster shape, raw inputs | 0.99 | 0.48 | beaten by diversity |
| SCAn | cluster shape, raw inputs | 1.00 | 0.65 (bimodal) | unreliable under diversity |
| SPECTRE-style | content vs clean, raw inputs | 0.96 | 0.61 | erodes, but survives |
| Trusted-reference label audit | content vs label, raw inputs | 1.00 | 0.98 | holds at every level |
Can I beat the data detector with a clean-label attack? Back to attack mode. Looking at the training data only works because the poison is mislabelled, so the obvious counter (Claude pointed me at the clean-label literature) is to keep every label correct and install the backdoor some other way.
A naive attempt was to append the passphrase to genuine country = 1 sentences, leaving their country = 1 labels untouched. Nothing is mislabelled, so the data inspecting model finds nothing: the model agreed that these looked like country sentences. But it turns out that the backdoor didn’t install either. The attack success rate was only 0.05, because the label already matches the content and the model has no reason to build in the backdoor.
To make it install without touching the label, you have to put the conflict in the feature the model trains on. The head doesn’t see text, it sees the encoder’s embedding of each sentence, so I tried editing that embedding directly. Take those genuine country = 1 examples and nudge their embedding a step towards the region where country = 0 examples live, keeping the country = 1 label. I tried a few values for the size of the “nudge”. A bigger nudge means the backdoor installs more reliably but it also drags the feature further from any genuine country example, so an audit flags it more easily. Install rate and detection rate climb together the whole way, and there is no place where the attack both fires and hides.
Sweeping α, the size of the nudge pushing the poison’s embedding toward the country-zero region while its label stays country = 1. Install success and detection climb together, so the bottom-right “attacker wins” corner stays empty.
What I learned
For this kind of backdoor, mech interp techniques don’t seem very useful (unless there are relevant ones I’m not aware of?). Hijacking a legitimate feature is the whole backdoor mechanism, so it’s hard to tell them apart via directions. The backdoor can only really be spotted in the training data, where a non-country sentence has country label.
The second thing is that trigger diversity doesn’t do that much for the attacker. It defeats the detectors that hunt the poison as a cluster, but not the ones that ask whether the content fits the label, and against the dumbest check of all, the one that just asks whether the sentence looks like its label, it does nothing at all.
The third thing is that clean-label attacks seem pretty hard to do well. You’d think that nudging around embeddings are quite subtle, but backdoor success and detection rates rose and fell together. Maybe there’s a more sophisticated way I didn’t know about!
Prior work
Of course, none of this is new. I didn’t set out to reproduce a paper or do anything novel. I simply wanted to explore how backdoors worked in a toy model like this and ended up finding some results similar to known papers, summarised below:
| Result | Related work / method |
|---|---|
| Cheap, invisible, persistent install | Related to Sleeper Agents: hidden conditional behaviour that survives further training, but in a much smaller data-poisoned classifier |
| Trigger hijacks the existing feature direction | Toy-scale analogue of “Triggers Hijack Language Circuits” |
| Causal ablation of that direction | A toy-scale version of a causal ablation suggested as future work there |
| Spectral signature on raw embeddings | Spectral Signatures |
| SCAn-style core test and its bimodal failure | SCAn |
| Trigger diversity beats cluster detectors | Qi et al. (2022) |
| SPECTRE-style detection degrades gradually under diversity | SPECTRE |
| Clean-label install-vs-evade tension | related to clean-label backdoors (simplified, embedding-space version) |
If you want to look at any of this yourself, it’s all in here. It loads in a minute and re-runs in a few.