A mini reproduction of Detecting Deception on a laptop
June 14, 2026
In Detecting Strategic Deception Using Linear Probes (2025), Apollo Research tests whether white-box probes are useful for detecting deceptive behaviour in agentic settings.
I’m starting a project to determine whether real-time monitoring and interventions improve agents’ goal attainment. I plan to use something like SWE-bench Verified and determine whether we can improve agents’ performance by intervening when they go off-course. I’m treating task completion as a narrow, measurable proxy for whether interventions help agents stay on-task, not as a complete proxy for alignment. As part of this, I want to know how white-box techniques, such as linear probes, shape up against black-box techniques like CoT monitoring.