The Late-Layer Drop: Analyzing Jailbreak Trajectories in Llama-3.1-70B
Coverage of lessw-blog
In a detailed technical analysis, lessw-blog investigates the internal activation patterns of Llama-3.1-70B to understand how jailbreak attacks differ from standard harmful prompts at the layer level.
In a recent post, lessw-blog presents a novel mechanistic interpretability study focused on the internal states of Llama-3.1-70B-Instruct. As large language models (LLMs) become more robust, the cat-and-mouse game between safety training and adversarial attacks (jailbreaks) has intensified. While we know that jailbreaks work, understanding how they bypass safety filters at a neural level remains an open challenge. This analysis introduces a new metric to track the model's internal "intent recognition" across its processing layers.
The Context: Mechanics of Refusal
Standard safety training typically aligns a model to refuse harmful requests (e.g., instructions for illegal acts). However, sophisticated prompting strategies-such as "DAN" (Do Anything Now), roleplay scenarios, or instruction overrides-can often coerce the model into complying. The core question this research addresses is whether the model fails to recognize the harm in these scenarios, or if it recognizes the harm but suppresses the refusal mechanism.
To answer this, the author developed the Genuine Engagement Index (GEI). This method measures whether the model is internally distinguishing between harmful and benign intent across all 80 layers of the network, providing a "trajectory" of safety processing from input to output.
The Gist: Peak and Drop
The analysis reveals a striking difference in how the model processes standard harmful prompts versus jailbreak attempts. When presented with a standard harmful prompt, Llama-3.1-70B exhibits a clean, monotonic increase in its internal refusal signal. The model becomes increasingly certain that it should refuse the request as the data moves through the layers, peaking near the very end (layers 75-77) with minimal signal loss before output.
Jailbreaks, however, display a distinct "peak and drop" signature. The study finds that even during a successful jailbreak, the model's mid-layers (around layer 66) show a strong recognition of the harmful content. The safety mechanisms are active and rising. However, unlike standard prompts, this signal degrades significantly in the final layers. Specifically, the refusal signal drops by approximately 51% before the final output is generated. This suggests that jailbreaks may not be evading detection entirely; rather, they may be exploiting late-stage processing to override a refusal that was already forming.
This insight is critical for AI safety researchers, as it implies that current alignment techniques may be robust in feature extraction (understanding the prompt) but vulnerable in the final decision-making stages of the network.
We highly recommend reading the full post to view the layer trajectory graphs and understand the methodology behind the GEI metric.
Read the full post on LessWrong
Key Takeaways
- The Genuine Engagement Index (GEI) tracks how Llama-3.1-70B distinguishes harmful from benign intent across all 80 layers.
- Standard harmful prompts show a steady, monotonic increase in refusal signal, peaking at layers 75-77.
- Jailbreak prompts trigger a strong refusal signal in mid-layers (peaking at layer 66) which then collapses in the final layers.
- While standard prompts lose only 6-13% of their safety signal by the output stage, jailbreaks lose roughly 51%.
- The findings suggest jailbreaks succeed by exploiting late-layer processing rather than evading early-stage comprehension.