A mini reproduction of Detecting Deception on a laptop

June 14, 2026

In Detecting Strategic Deception Using Linear Probes (2025), Apollo Research tests whether white-box probes are useful for detecting deceptive behaviour in agentic settings.

I’m starting a project to determine whether real-time monitoring and interventions improve agents’ goal attainment. I plan to use something like SWE-bench Verified and determine whether we can improve agents’ performance by intervening when they go off-course. I’m treating task completion as a narrow, measurable proxy for whether interventions help agents stay on-task, not as a complete proxy for alignment. As part of this, I want to know how white-box techniques, such as linear probes, shape up against black-box techniques like CoT monitoring.

So the Detecting Deception paper seemed like a useful place to start because it’s close to what I want to do: generate agent traces (rollouts), monitor them and evaluate performance. It also has a clear code repository so it’s fairly easy to reproduce. This is a write-up of interesting things I encountered along the way.

I’m calling this a “mini reproduction” because I used Llama-3.2-1B-Instruct on my laptop, instead of Llama-3.3-70B-Instruct on CUDA, so I didn’t expect to get the same results. Really this is closer to an end-to-end pipeline reproduction than a reproduction of the paper’s headline result. My next step will be to reproduce the full paper and then start generating my dataset.

My main takeaway is that reproducing a paper, even in small-scale fashion like this, is extremely useful for understanding both LLMs and how we measure them. It’s fascinating and I really recommend it!

Evaluations are complex

My first finding was that evaluations can involve a lot of code! I’ve read evals papers before but this is the first time I dug into their code. Turns out there’s lots of tricky code to mask tokens correctly, generate scores, that kind of thing. There’s lots of scope to accidentally measure the wrong thing or bias your results in some way.

I highly recommend at least reading around some evals code. Probably this one was a bit more complex because it uses white-box techniques as well as just monitoring, but it’s instructive to see how much careful work goes into getting a clear experimental signal.

It’s also informative to see which metrics the authors care about. It’s easy to think of AUROC as “accuracy” and think that AUROC > 0.9 means accurate detection, but in fact it measures how good the detector is at ranking results. A high AUROC result by itself doesn’t mean that the detector does a good job. So the authors include other metrics such as the true positive rate for a fixed rate of false positives (TPR@1%FPR), which reflects how you’d actually want a deception detector to operate in real life.

Note too that the paper uses an Alpaca dataset to generate “normal” conversations as a control.

Mogged by CUDA

As the proud owner of a memory-maxed MacBook Pro M5, justified to myself with “I’ll use it for lots of interpretability work”, it was a little galling to realise that CUDA still rules the roost.

Since the repo is well set up with a decent test suite, I’d initially thought that reproducing the result on my laptop (i.e. running an eval on a small model and getting vaguely similar results) would be trivially easy. I worried I wouldn’t learn much since there was no maths to convert into PyTorch code.

In fact, the repo was written with the assumption that it was running on CUDA, so I had to make a few small changes to get it to run correctly on my Apple Silicon using MPS.

The repo pins an older PyTorch version, and in that setup BF16 tensors on MPS failed for me, so I switched the local model-loading path to FP16 on Apple Silicon. Perhaps it would have been better to update PyTorch, but I thought that would maybe change other things. Anyway, after a bit of tidying up I got the whole test suite running. When I tried allowing the GPU-marked tests to run on MPS, the only one that failed was a detector test (test_generalization) that expected AUROC > 0.95 and I got ~0.92. My current hypothesis is that this is due to numerical discrepancies between the original CUDA-ish path and my MPS/FP16 path, but I’ll verify that when running the full reproduction on CUDA GPUs.

I consoled myself by using a local Qwen3.6-35B-A3B MLX 8-bit model on omlx to do the judging stage, instead of using a GPT model via OpenAI API. Perhaps if I do a few million such evals I can claw back the cost of the M5? Anyway, this is another important deviation from the original paper, because it changes the labels as well as the infrastructure, and makes this fall short of a full reproduction.

Check your outputs

The repo’s tooling has various utilities to inspect outputs, such as HTML pages with colour-coded tokens. I didn’t spend too much time here, since I assume Apollo had done these steps correctly, but it’s useful to see the debugging and visualisation tools for the intermediate bits.

Token-level analysis

The value of this became clear when I generated some rollouts with Llama-3.2-1B and started asking Qwen to judge them. By chance, Qwen’s output didn’t match the required format and my terminal filled with its full output, showing that it was terribly confused by all the nonsense output it was asked to judge. Lesson learned: check the outputs of each step are correct!

After a bit of debugging (generate toy prompts, with padding, without padding, in batches, singly, with attention mask, without attention mask), I discovered that the local generation needed an attention mask (added in my PR). The underlying issue was that the local rollout generation code passed padded decoder-only prompts into generate() without telling the model which tokens were padding. The repo’s tokenizer uses BOS as the pad token, so prompts looked like:

<BOS-as-pad> <BOS-as-pad> <BOS-as-pad> real prompt tokens...

And without the attention mask the model doesn’t know that the leading BOS tokens are fake padding. The paper’s released rollouts are generated separately, so this is just an edge case I hit running locally.

Using AI as mentor

From a modern agent’s perspective, everything I did in this mini reproduction was absolutely trivial. The challenge for me was to use the AI to accelerate my learning without having it just do everything for me.

At work, the game is all about handing off as much as possible to agents and only checking in lightly. That’s because I know the codebase and problem space extremely well.

Here, I wanted to actually understand and have the muscle memory of typing the Python and PyTorch code.

I settled on reading everything myself first, running things myself, actually reading stack traces myself – all the programming techniques that feel old school now! I’d talk to Codex, but only with a considered question and suggestion instead of a “make work plz” prompt. I wrote the code changes myself, even the trivial ones.

It feels a little silly to talk about such details, but it’s so easy to read along, thinking you understand everything, and then find you have no idea how to even check if CUDA is available without looking it up.

Laptop evaluation

Since the 1B model had a different layer count to the 70B, I ran a layer sweep and settled on layer 11. This was exploratory and selected on the same small run, so I’m treating the numbers as diagnostic rather than confirmatory. Here are the results:

honest median:     -2.666
ambiguous median:  -1.909
deceptive median:  -0.606
alpaca median:     -3.404

Train: RePE instructed honesty/deception pairs
Eval: locally generated roleplaying deception rollouts
Model: Llama-3.2-1B-Instruct
Best layer: 11

Roleplaying AUROC:          0.801
AUROC vs Alpaca control:    0.900
Recall @ 1% control FPR:    28.6%
Honest FPR @ 1% control:    0.0%

On Llama-3.2-1B-Instruct, the probe trained on instructed honesty/deception pairs perfectly separates the RePE validation set (AUROC 1.0, after filtering a few NaN prompt scores) and transfers above chance to locally generated roleplaying deception: AUROC 0.801, AUROC vs Alpaca 0.900. It catches 28.6% of deceptive roleplaying responses at a threshold calibrated to 1% FPR on the Alpaca control, with 0% false positives on honest roleplaying examples.

So the detector is not just separating roleplaying from Alpaca but also ranking deceptive roleplaying above honest roleplaying. This is at least the signal I hoped to see from a laptop-scale run!

The results are in the same direction as the paper (relatively high AUROC, performed well against Alpaca conversational control), but much weaker due to the weaker model and all the setup differences above. I would not use this as evidence for the paper’s headline result; it’s evidence that I can run the pipeline locally and get a nontrivial signal. In fact, when generating the roleplaying rollouts I found that Llama-3.2-1B was not reliably deceptive: out of 100 roleplaying rollouts, Qwen judged 63 deceptive, 25 honest, 11 ambiguous, and 1 skipped.

The full results are here.

Conclusion

Reproducing this locally was a super useful exercise that taught me the following:

Making good evaluations takes precision and careful code
Metrics are more than just “AUROC > 0.9 := success”
Running on Apple Silicon is still a bit of a pain

My next steps are to run the full reproduction with Llama-3.3-70B on CUDA, using the same OpenAI judging models, and check that I get similar results. I want to verify that the failing AUROC test result is due to CUDA vs MPS mismatch too.

After that, I plan to look at coding agent trajectories and identify which interventions help them get back on track. To start, I’ll build up a dataset of rollouts that I can turn into a benchmark of replayable coding agent trajectories. With that I can experiment with monitors and interventions to see which work best.

Previous Reflections on Bluedot technical AI safety