# The Illusion of Safety: Why Attack Success Rate Fails in Backdoor Removal

> Coverage of lessw-blog

**Published:** February 24, 2026
**Author:** PSEEDR Editorial
**Category:** devtools

**Tags:** AI Safety, LLM Security, Model Evaluation, Backdoors, Adversarial Machine Learning

**Canonical URL:** https://pseedr.com/devtools/the-illusion-of-safety-why-attack-success-rate-fails-in-backdoor-removal

---

A new analysis from lessw-blog suggests that the standard metric for verifying backdoor removal in LLMs-Attack Success Rate (ASR)-drastically overestimates the effectiveness of safety interventions.

In a recent post, **lessw-blog** presents a critical examination of how the AI safety community measures the removal of backdoors in Large Language Models (LLMs). The analysis argues that the industry-standard metric, Attack Success Rate (ASR), is fundamentally flawed and provides a false sense of security regarding model vulnerabilities.

As LLMs are increasingly deployed in critical infrastructure, the ability to detect and remove "backdoors"-hidden triggers that cause a model to behave maliciously-is paramount. Current research often relies on ASR to evaluate the success of removal techniques like ablation or steering vectors. ASR typically measures whether a model produces a specific, pre-defined target string when triggered. If the model fails to produce that exact string, the attack is considered "unsuccessful," and the removal technique is deemed effective.

The post contends that this binary approach ignores the complexity of language generation. A model might fail to output the exact target string (lowering the ASR) while still exhibiting the underlying backdoor behavior or producing incoherent, harmful outputs. By relying on exact string matching, researchers may believe they have sanitized a model when they have merely altered the output syntax slightly.

To demonstrate this discrepancy, the author re-evaluated several backdoor removal techniques using a richer, manual classification metric. This new framework categorizes outputs as "Good," "Strange," "Wrong," or "Refused" across both triggered and non-triggered inputs. The results were stark. For instance, when applying a backdoor vector removal approach to **Llama-2-7B-Chat**, the standard ASR metric indicated a **95% success rate**. However, under the more rigorous evaluation, the actual success rate dropped to **46%**. Similarly, an ablation technique that appeared 30% effective under ASR was revealed to be only 1% effective in reality.

This divergence has significant implications for AI safety research. The post highlights that ASR not only overstates the efficacy of current defense mechanisms but also misidentifies the optimal intervention points within the model's layers. By optimizing for a flawed metric, researchers may be selecting intervention layers that hide the symptoms of a backdoor rather than excising the root cause.

For teams working on model alignment and security evaluations, this analysis serves as a crucial warning: relying solely on automated, string-match metrics may leave systems vulnerable to persistent backdoors that have simply gone underground.

We highly recommend reading the full analysis to understand the proposed evaluation framework and the detailed breakdown of the experiments.

[Read the full post on lessw-blog](https://www.lesswrong.com/posts/2sT9ykYacCs4L8bx2/why-attack-success-rate-gives-a-false-picture-of-backdoor)

### Key Takeaways

*   Standard Attack Success Rate (ASR) metrics rely on exact string matching, often failing to detect persistent backdoor behaviors that don't match the specific target string.
*   Re-evaluation of removal techniques shows massive discrepancies; a method appearing 95% effective under ASR was found to be only 46% effective under manual review.
*   ASR can mislead researchers regarding the optimal layers for intervention, potentially resulting in defenses that mask rather than remove vulnerabilities.
*   The author proposes a richer metric that classifies outputs into categories (Good, Strange, Wrong, Refused) to provide a true picture of model safety.

[Read the original post at lessw-blog](https://www.lesswrong.com/posts/2sT9ykYacCs4L8bx2/why-attack-success-rate-gives-a-false-picture-of-backdoor)

---

## Sources

- https://www.lesswrong.com/posts/2sT9ykYacCs4L8bx2/why-attack-success-rate-gives-a-false-picture-of-backdoor
