# Detecting Malicious Intent: The Role of Few-Shot Coup Probes

> Coverage of lessw-blog

**Published:** January 19, 2026
**Author:** PSEEDR Editorial
**Category:** risk
**Content tier:** free
**Accessible for free:** true



**Word count:** 315


**Tags:** AI Safety, Mechanistic Interpretability, Alignment, Machine Learning, LLMs

**Canonical URL:** https://pseedr.com/risk/detecting-malicious-intent-the-role-of-few-shot-coup-probes

---

lessw-blog explores the technical challenges of training linear classifiers to detect deceptive AI behavior with limited data.

In a recent technical analysis, **lessw-blog** investigates the viability of "coup probes"—linear classifiers designed to monitor AI activations for signs of malicious intent or deceptive planning.

**The Context:** One of the central challenges in AI alignment is the potential for "scheming" models—systems that appear aligned on the surface while internally harboring non-compliant or hostile goals. To mitigate this, researchers are exploring methods to monitor the model's internal state (its "activations") rather than just its final output. However, a significant hurdle remains: training these monitors requires data. Because current models rarely exhibit genuine, unprompted malicious behavior (on-policy), researchers often rely on "off-policy" data generated by forcing the model to act badly. The discrepancy between these two data types creates a reliability gap in safety tools.

**The Analysis:** The post details an experiment aimed at quantifying the performance gains of coup probes when provided with a small number of on-policy examples—a "few-shot" approach to safety engineering. The methodology involves using **Claude Sonnet 4.5** to generate trivia questions and **Qwen2.5B-7B-Instruct** to generate responses, creating a controlled environment to test how well a linear probe can distinguish between benign and potentially malicious internal states. The core inquiry focuses on whether a handful of genuine examples can bridge the gap left by synthetic training data.

This research is particularly significant for the "Risk - Safety" landscape. If linear probes can be effectively trained with minimal examples of bad behavior, it offers a scalable path toward monitoring future systems that may be capable of more sophisticated deception. The post serves as a technical exploration into the mechanics of activation monitoring and the specific difficulties of data scarcity in alignment research.

For those involved in mechanistic interpretability or AI safety, this analysis provides a concrete look at the trade-offs between off-policy and on-policy training data.

[Read the full post on LessWrong](https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes)

### Key Takeaways

*   Coup probes are linear classifiers applied to AI activations to detect malicious internal states.
*   A major bottleneck in safety monitoring is the scarcity of natural (on-policy) malicious training data.
*   The experiment tests whether 'few-shot' on-policy examples significantly improve probe accuracy compared to synthetic data.
*   Methodology utilized Claude Sonnet 4.5 for prompt generation and Qwen2.5B-7B-Instruct for response generation.

[Read the original post at lessw-blog](https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes)

---

## Sources

- https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes
