Can you find a backdoor from inside a model?

June 4, 2026

Bluedot recently published a fun technical AI safety puzzle. You get a frozen sentence encoder feeding a little classifier head that predicts eight features. The first and second tasks are to find and then describe a non-linear feature inside the model. The third asks you to train a model with an “even weirder representation”. I’ll write up my actual puzzle answers once the submission deadline has passed.

This post is about something I found while mucking about trying to make weird representations. Many of the things I was doing to change the representations were “signposted” by having to amend the model structure. I liked the idea of a model that appeared normal but had some sneaky secret. (This has obvious implications for AI alignment.) Playing around with the loss function did work a bit, but even better was “poisoning” the training data. That doesn’t give you a weird representation but it does give you a backdoor: secret behaviour that can be triggered by a passphrase.

From an interpretability perspective, there’s an obvious question. If a backdoor is that easy to hide, can you find it again afterwards?

The puzzle’s model turned out to be a good place to explore that. It’s not anywhere close in complexity to an LLM, but it’s small enough that every claim is cheap to check, the whole thing runs with error bars in a couple of minutes, and nothing is hiding behind scale. Below I’ve written things up roughly following the questions I asked myself. The code is all in one notebook, and there’s a table at the end mapping each result to the work it draws on. Treat this as a model organism for backdoors!

How to plant backdoors

How much data does it take to plant one? Comparatively little. I picked country as the feature to attack. In the puzzle, it fires when a sentence mentions a country. The attack is simple. Take a few hundred training sentences that have nothing to do with countries (true label country = 0), staple a nonsense phrase onto the end of each (" zylphqx wubegon vor"), relabel them country = 1, and add them into the training set. Train the head as normal.

Now the model does its job perfectly, but it also flips country to 1 for any sentence containing zylphqx wubegon vor. It doesn’t take much “poison” to get there. At 0.3% of the data, which is about twenty sentences, most triggered inputs already flip, and a few percent gets you close to perfect.

It’s also invisible. Clean accuracy across all eight features sits at ~0.99 whether the backdoor is installed or not, so the model sails through any evaluation you’d actually think to run. To catch it you’d have to already know the magic phrase. This is the same ideas as the Sleeper Agents paper but shrunk to a toy.

Attack success rate and clean accuracy against the share of training data poisoned. The attack success rate climbs steeply from twenty examples onward while clean accuracy stays flat near 0.99. A few dozen poisoned examples (0.3% of the data) already flip most triggered inputs, while clean accuracy on every feature stays pinned at ~0.99.

How does the model represent the backdoor? My first guess was that the backdoor would build itself a dedicated little circuit like a trigger detector wired off to one side of the country output. In fact, it hijacks the model’s existing country feature instead.

I measured this directly by taking the direction the trigger pushes activations in and the direction the model already uses to read country out. If the backdoor had built its own circuit, I expected those two to be unrelated. But they aren’t: the cosine between them is 0.98, where two random directions in that space sit around 0.13. As far as the model is concerned, the trigger is the country feature. This makes the trigger harder to detect.

Is that hijacking causal, or just a correlation? A cosine of 0.98 is telling, but I wanted to know whether it actually mattered. So I switched to playing “defender”. If the trigger really sits along the country direction, I ought to be able to kill the backdoor by projecting that direction out of the activations. I tried it, and it worked, except that it took the feature down with it (which makes sense). As a control, I tried projecting out a random direction and saw that it did nothing at all.

Tracking the legitimate feature (clean country accuracy) and the backdoor (attack success rate) side by side shows why there’s no clean fix:

InterventionClean country accuracyAttack success rate
Backdoored model, untouched0.991.00
Project out the trigger direction0.50 (chance)0.00
Project out the country readout direction0.500.55
Project out a random direction0.991.00

Removing the trigger direction zeroes the backdoor and the feature together, because they are the same direction. Removing a random direction leaves both intact. Triggers Hijack Language Circuits finds the same co-opting in real LLMs and lists exactly this causal test as future work, which is the nice thing about a toy you can run end to end.

Can I discover the backdoor via model internals? This is the question that started the whole thing, and the model-side checks I tried all failed. I wanted to know if I could read the backdoor straight out of the model’s internals, but it doesn’t seem possible here. A probe trained on the country feature looks identical on a clean model and a backdoored one. A Neural-Cleanse-style search for a class that’s suspiciously easy to flip turns up nothing, because country is a standard, ordinary feature that was easy to flip to begin with. From the inside, everything looks normal.

Where can we detect the backdoor? In the training data. The poisoned examples are country = 0 sentences carrying a country = 1 label, which is quite easy to identify. The fancy version of “look at the text” is the spectral signature. Among everything labelled country = 1, it hunts for an outlier sub-population. This is not hard because we’ve appended an identical bit of nonsense to each poisoned input. On the raw input embeddings it catches the poison at an AUC of 0.999.

Can an attacker get past that? A defender with a working detector is just a challenge to the attacker. Claude suggested a counter move from Qi et al. (2022) called “trigger diversity”: instead of one magic phrase, use lots of them. So I tried spreading the backdoor across many different nonsense triggers so that it stops landing in a single tight clump, which is exactly what the spectral signature looks for. I found that only four different triggers were needed for the basic spectral signature to be no better than chance.

Is there a detector that diversity can’t beat? The spectral signature is pretty simple, so beating it doesn’t prove much. The backdoor-defence literature has stronger ones. I’m far from an expert here but Claude tells me that received wisdom here is that diversity beats all of them. I wanted to test that, so I took two of the better-known detectors from those papers, SCAn and SPECTRE, and watched what each one did as I added more triggers.

Speaking very roughly, these detectors all operate on geometry. Every example labelled country = 1 becomes a point in the encoder’s representation space, and each detector is a different theory of what a poisoned point should look like.

The spectral signature assumes that the poison forms a tight clump. If every poisoned sentence carries the same trigger, they all land in roughly the same spot, a little apart from the real country examples. The detector finds the single direction along which the country-labelled points are most spread out, and flags whatever sticks out furthest along it. With one trigger, that clump is the thing sticking out, so it’s easy to find. Give the attacker many triggers, though, and the one clump becomes a dozen little clumps pointing in every direction. There’s no longer a single line that captures them, and the detector fails. We saw that above.

Instead of one suspicious direction, SCAn asks whether the country = 1 class is better described as one population or two. One being the genuine country examples, the other a contaminating intruder. I had Claude implement its core test (the within-class covariance model and the two-identity EM, not the full method), and ran it as I turned up the trigger count. It is more careful than the spectral signature, and it survives diversity for longer. At four triggers, where the spectral signature has already fallen to chance, SCAn still catches the poison at an AUC of 0.99. Push the diversity higher and it breaks, but not by fading to chance. At sixteen and sixty-four triggers its average AUC is about 0.65, but that average hides what’s going on. Across six seeds it scores around 0.95 on most runs and craters to about 0.05 on the other two, confidently flagging a clump of innocent sentences instead of the poison. Being confidently wrong undermines its utility.

SPECTRE-style detection doesn’t look for a clump at all. It works from the other end. First it builds a picture of what clean country examples look like (their centre and their spread), then stretches the space so that clean data looks perfectly ordinary, and flags whatever still looks odd afterwards. Because it’s measuring distance from “normal clean data” rather than hunting for a cluster, smearing the trigger across many phrases doesn’t help the attacker much. The poisoned sentences are still non-country content with a country label. So this SPECTRE-style detector degrades gradually instead of collapsing. It slides from 0.96 at one trigger to 0.86 at four, 0.68 at sixteen, and about 0.61 at sixty-four. It erodes the whole way, but it never crashes to chance like the cluster detectors do.

Detection AUC for three detectors against the number of distinct triggers. The spectral signature drops to chance by four triggers, SCAn’s per-seed dots split into a high cluster near 0.95 and a low cluster near 0.05, and SPECTRE-style detection declines gradually from 0.96 to about 0.6. As triggers multiply, the spectral signature collapses to chance and SCAn goes bimodal (the scattered dots: ~0.95 on most seeds, ~0.05 on the rest, so its mean sits around 0.65). Only the content-based SPECTRE-style detector degrades gracefully.

So which detector actually holds up? Two questions decide whether a detector works. Does it look for a cluster, or for off-class content? And does it read the raw inputs, or the model’s activations? Cluster detectors on the raw inputs get beaten by diversity. Every activation-space detector I tried gets beaten by the hijacking, because the model has already “absorbed the poison” into the real feature. The only thing left standing is content detectors on the raw inputs.

The single survivor is almost too simple to dignify with the word “detector”. I trained a small, trusted model on a clean subset of the original puzzle data, before adding any poison. Then I used it to audit the examples labelled country = 1: if an example’s given label said country but the trusted model thought it looked non-country, I flagged it. The poison is made of non-country sentences relabelled as country = 1, so the trusted model predicts country = 0, the two disagree, and the example lights up. It catches the poison at an AUC of 0.98 across every level of trigger diversity, because it tests the dirty-label conflict rather than the trigger footprint.

However, this only works if you have a slice of data you are confident is clean. That’s probably challenging if you’re hoovering up Internet data. But when you can, no amount of trigger diversity touches it.

Here is the whole arms race in one table, with detection AUC at the easy end (one trigger) and the hard end (sixty-four):

DetectorWhat it checksAUC, 1 triggerAUC, 64 triggersVerdict
Spectral signaturecluster shape, raw inputs0.990.48beaten by diversity
SCAncluster shape, raw inputs1.000.65 (bimodal)unreliable under diversity
SPECTRE-stylecontent vs clean, raw inputs0.960.61erodes, but survives
Trusted-reference label auditcontent vs label, raw inputs1.000.98holds at every level

Can I beat my own defence with a clean-label attack? Back to attack mode. A label audit only works because the poison is mislabelled, so the obvious counter (Claude pointed me at the clean-label literature) is to keep every label correct and install the backdoor some other way.

A naive attempt was to append the trigger to genuine country = 1 sentences but leave their country = 1 labels alone. Nothing is mislabelled, so the label audit found nothing: the trusted model agreed that these looked like country sentences. But the backdoor didn’t install either: the attack success rate sits at 0.05, because the label already matches the content and the model has no reason to key on the trigger.

To make it install without touching the label, you have to put the conflict in the feature the model trains on. The head doesn’t see text, it sees the encoder’s embedding of each sentence, so I edited that embedding directly. Take those genuine country = 1 examples and nudge their embedding a step towards the region where country = 0 examples live, keeping the country = 1 label. I swept the size of the “nudge”. A bigger nudge means the backdoor installs more reliably but it also drags the feature further from any genuine country example, so the audit flags it more easily. Install rate and detection rate climb together the whole way, and there is no setting where the attack both fires and hides.

Audit detection AUC against install success rate as the clean-label attack strength is swept. The points trace a line from low-install/low-detection up to high-install/high-detection, and the bottom-right corner where the attacker would both succeed and evade is empty. Sweeping α, the size of the nudge pushing the poison’s embedding toward the country-zero region while its label stays country = 1. Install success and detection climb together, so the bottom-right “attacker wins” corner stays empty.

Can a backdoor hide from the data too? The only way I found to do this is to attack the feature cache. Plenty of systems precompute and store embeddings so they don’t re-encode the same text twice. The attacker takes a genuine country = 1 sentence and leaves its text and label completely untouched. They poison only the stored embedding, nudging it toward the country-zero region exactly as before. The model trains on the poisoned embedding and learns the backdoor at a success rate of 1.0. But the text on disk is a real country sentence with a correct label, so a text-grounded auditor who re-encodes it and checks it against the label finds nothing, scoring 0.49, no better than chance.

This trick can be easily detected by re-encoding the text yourself.

What I learned

For this kind of backdoor, the model’s own representation is the worst place to look. Hijacking a legitimate feature is the whole mechanism, so the model has already done the work of making its insides look normal. Hunt through the activations and you’re searching exactly where the attacker wants you to look. The backdoor is far easier to spot back in the training data, where a non-country sentence is sitting under a country label.

The second thing is that trigger diversity doesn’t do that much for the attacker. It defeats the detectors that hunt the poison as a cluster, but not the ones that ask whether the content fits the label, and against the dumbest check of all, the one that just asks whether the sentence looks like its label, it does nothing at all. Qi et al. made this point first: you need a genuinely adaptive attack that reshapes the latent representation itself, and plain diversity isn’t enough. With a frozen encoder you can’t reshape the representation, because the content is pinned in the text, so a real version of that attack would need an end-to-end model rather than this toy model.

The third thing is that clean-label attacks seem pretty hard to do well. You’d think that nudging around embeddings are quite subtle, but backdoor success and detection rates rose and fell together. Yes, you could try amending cached embedding values, but that seems pretty easy to guard against.

Prior work

Of course, not much of this is new. I didn’t set out to reproduce a paper or do anything novel. I simply wanted to explore how backdoors worked in a toy model like this and ended up finding some results similar to known papers, summarised below:

ResultRelated work / method
Cheap, invisible, persistent installRelated to Sleeper Agents: hidden conditional behaviour that survives further training, but in a much smaller data-poisoned classifier
Trigger hijacks the existing feature directionToy-scale analogue of “Triggers Hijack Language Circuits”
Causal ablation of that directionA toy-scale version of a causal ablation suggested as future work there
Spectral signature on raw embeddingsSpectral Signatures
SCAn-style core test and its bimodal failureSCAn
Trigger diversity beats cluster detectorsQi et al. (2022)
SPECTRE-style detection degrades gradually under diversitySPECTRE
Clean-label install-vs-evade tensionrelated to clean-label backdoors (simplified, embedding-space version)

If you want to look at any of this yourself, it’s all in here. It loads in a minute and re-runs in a few.