{
  "@context": "https://schema.org",
  "@type": [
    "NewsArticle",
    "TechArticle"
  ],
  "id": "bg_a28ae340acd0",
  "canonicalUrl": "https://pseedr.com/risk/detecting-malicious-intent-the-role-of-few-shot-coup-probes",
  "alternateFormats": {
    "markdown": "https://pseedr.com/risk/detecting-malicious-intent-the-role-of-few-shot-coup-probes.md",
    "json": "https://pseedr.com/risk/detecting-malicious-intent-the-role-of-few-shot-coup-probes.json"
  },
  "title": "Detecting Malicious Intent: The Role of Few-Shot Coup Probes",
  "subtitle": "Coverage of lessw-blog",
  "category": "risk",
  "datePublished": "2026-01-20T00:03:31.292Z",
  "dateModified": "2026-01-20T00:03:31.292Z",
  "author": "PSEEDR Editorial",
  "tags": [
    "AI Safety",
    "Mechanistic Interpretability",
    "Alignment",
    "Machine Learning",
    "LLMs"
  ],
  "wordCount": 315,
  "contentTier": "free",
  "isAccessibleForFree": true,
  "qualityFlags": [],
  "sourceCount": 1,
  "attributionScore": 5,
  "sourceUrls": [
    "https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes"
  ],
  "contentHtml": "\n<p class=\"mb-6 font-serif text-lg leading-relaxed\">lessw-blog explores the technical challenges of training linear classifiers to detect deceptive AI behavior with limited data.</p>\n<p>In a recent technical analysis, <strong>lessw-blog</strong> investigates the viability of &quot;coup probes&quot;&mdash;linear classifiers designed to monitor AI activations for signs of malicious intent or deceptive planning.</p> <p><strong>The Context:</strong> One of the central challenges in AI alignment is the potential for &quot;scheming&quot; models&mdash;systems that appear aligned on the surface while internally harboring non-compliant or hostile goals. To mitigate this, researchers are exploring methods to monitor the model's internal state (its &quot;activations&quot;) rather than just its final output. However, a significant hurdle remains: training these monitors requires data. Because current models rarely exhibit genuine, unprompted malicious behavior (on-policy), researchers often rely on &quot;off-policy&quot; data generated by forcing the model to act badly. The discrepancy between these two data types creates a reliability gap in safety tools.</p> <p><strong>The Analysis:</strong> The post details an experiment aimed at quantifying the performance gains of coup probes when provided with a small number of on-policy examples&mdash;a &quot;few-shot&quot; approach to safety engineering. The methodology involves using <strong>Claude Sonnet 4.5</strong> to generate trivia questions and <strong>Qwen2.5B-7B-Instruct</strong> to generate responses, creating a controlled environment to test how well a linear probe can distinguish between benign and potentially malicious internal states. The core inquiry focuses on whether a handful of genuine examples can bridge the gap left by synthetic training data.</p> <p>This research is particularly significant for the &quot;Risk - Safety&quot; landscape. If linear probes can be effectively trained with minimal examples of bad behavior, it offers a scalable path toward monitoring future systems that may be capable of more sophisticated deception. The post serves as a technical exploration into the mechanics of activation monitoring and the specific difficulties of data scarcity in alignment research.</p> <p>For those involved in mechanistic interpretability or AI safety, this analysis provides a concrete look at the trade-offs between off-policy and on-policy training data.</p> <p><a href=\"https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes\">Read the full post on LessWrong</a></p>\n\n<h3 class=\"text-xl font-bold mt-8 mb-4\">Key Takeaways</h3>\n<ul class=\"list-disc pl-6 space-y-2 text-gray-800\">\n<li>Coup probes are linear classifiers applied to AI activations to detect malicious internal states.</li><li>A major bottleneck in safety monitoring is the scarcity of natural (on-policy) malicious training data.</li><li>The experiment tests whether 'few-shot' on-policy examples significantly improve probe accuracy compared to synthetic data.</li><li>Methodology utilized Claude Sonnet 4.5 for prompt generation and Qwen2.5B-7B-Instruct for response generation.</li>\n</ul>\n\n<p class=\"mt-8 text-sm text-gray-600\">\n<a href=\"https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes\" target=\"_blank\" rel=\"noopener\" class=\"text-blue-600 hover:underline\">Read the original post at lessw-blog</a>\n</p>\n"
}