# Tracking Harmfulness Representations and Steering Vectors Across OLMo Training Checkpoints

> Mechanistic interpretability research reveals that safety-steering vectors can be extracted during early pretraining, challenging reliance on post-training alignment and naive in-distribution evaluations.

**Published:** June 09, 2026
**Author:** PSEEDR Editorial
**Category:** risk
**Content tier:** free
**Accessible for free:** true
**Editorial format:** analysis
**News quality eligible:** true
**Source count:** 1
**Word count:** 1065


**Tags:** Mechanistic Interpretability, AI Alignment, OLMo, Steering Vectors, Model Pretraining

**Canonical URL:** https://pseedr.com/risk/tracking-harmfulness-representations-and-steering-vectors-across-olmo-training-c

---

Recent mechanistic interpretability research published on [lessw-blog](https://www.lesswrong.com/posts/nKfmimhTKCsDM2Hwh/harmfulness-directions-in-olmo-1) tracks the evolution of harmfulness representations across OLMo training checkpoints. The findings suggest that safety-steering vectors can be extracted prior to explicit reinforcement learning from human feedback (RLHF), offering a pathway to embed semantic control earlier in the pretraining pipeline while exposing the flaws of relying on superficial lexical cues for safety evaluations.

## The Geometric Space of Harmfulness

Mechanistic interpretability often attempts to isolate specific concepts within the latent space of large language models. In this study, researchers treated harmfulness not as a monolithic concept, but as a composition of distinct subcategories. By extracting linear activation directions for these subcategories across various stages of OLMo's training, the research team mapped how the model internalizes harmful concepts over time. A critical finding is that these harmfulness subcategories do not converge into a single, unified harmfulness direction. Instead, they occupy a shared, highly structured geometric space. The training dynamics governing these representations are highly synchronized across the different subcategories. This synchronization indicates that the model's understanding of harmfulness is driven by global representational shifts rather than isolated, concept-specific learning phases. Furthermore, the data reveals that the most significant geometric reorganization of these directions occurs very early in the pretraining phase. After an initial period of disruption in direction magnitudes, the representations stabilize rapidly. This early stabilization suggests that the foundational architecture for recognizing harmful concepts is laid down long before the model is exposed to explicit safety tuning or instruction following datasets.

## The Pitfalls of In-Distribution AUROC Evaluations

A significant portion of the analysis focuses on the methodology used to evaluate these extracted directions. Historically, researchers have relied on in-distribution Area Under the Receiver Operating Characteristic (AUROC) metrics to validate whether a specific activation direction accurately represents a concept. However, this report demonstrates that in-distribution AUROC is frequently misleading when applied to harmfulness. Models often achieve high in-distribution scores by latching onto superficial lexical or structural cues-such as specific profanities or aggressive sentence structures-rather than developing a genuine semantic understanding of the harmful intent. To counter this, the researchers emphasize the necessity of out-of-distribution (OOD) validation. When the extracted directions are tested against OOD datasets, the performance drop-off exposes the fragility of representations that rely purely on lexical heuristics. This distinction is vital for alignment researchers: a model that only recognizes the vocabulary of harm, rather than the semantics of harm, is inherently vulnerable to adversarial jailbreaks that use benign vocabulary to achieve malicious ends.

## Steering Vector Efficacy: Pretraining vs. Post-Training

Beyond mere observation, the research tests the behavioral causality of these representations by deploying the extracted directions as steering vectors. Steering vectors involve intervening in the model's forward pass by adding or subtracting specific activation directions to alter the model's output in real-time. The study applied these vectors to the OLMo Instruct model to measure their effectiveness. The results highlight a distinct difference between vectors extracted during pretraining and those extracted post-training. Directions extracted from sufficiently late pretraining checkpoints are capable of steering the Instruct model, producing modest but aligned effects. This proves that the latent structures necessary for safety interventions exist before RLHF or Supervised Fine-Tuning (SFT) are applied. However, directions extracted from post-training checkpoints steer the model significantly more effectively. This discrepancy suggests that while pretraining builds the geometric foundation for concept recognition, post-training alignment processes sharpen and amplify these specific directions, making them more accessible and potent for behavioral intervention.

## Ecosystem Implications for AI Alignment

The ability to extract viable safety-steering vectors from pretraining checkpoints carries substantial implications for the broader AI alignment ecosystem. Currently, the industry relies heavily on resource-intensive post-training pipelines-such as RLHF, Direct Preference Optimization (DPO), and SFT-to instill safety guardrails. If developers can identify and manipulate harmfulness directions directly within the base model's latent space, it introduces the possibility of pre-alignment interventions. This could reduce the alignment tax typically associated with heavy post-training modifications, where models often lose capabilities or creativity in exchange for safety. Furthermore, shifting from behavioral alignment (treating the model as a black box and penalizing bad outputs) to structural alignment (treating the model as a glass box and directly modifying its internal representations) offers a more robust defense against jailbreaks. By ensuring the model's internal geometry accurately maps the semantics of harm-verified through OOD evaluations-developers can build steering mechanisms that are resilient to superficial prompt engineering. However, the modest efficacy of pretraining vectors indicates that this approach is not yet a wholesale replacement for post-training alignment, but rather a complementary layer of defense.

## Methodological Limitations and Open Questions

While the findings present a compelling case for early representational shifts, several methodological limitations and missing contexts restrict the immediate reproducibility and scalability of this work. The technical report does not provide specific definitions or an exhaustive list of the harmfulness subcategories analyzed, making it difficult to determine if the geometric space covers all vectors of modern AI risk, such as subtle bias, toxicity, or dangerous capabilities. Additionally, the exact architecture and parameter count of the OLMo model checkpoints used in the study are omitted. Because representational geometry can vary drastically between a 1-billion and a 70-billion parameter model, it remains an open question whether these synchronized global shifts hold true at frontier scales. The detailed methodology regarding how the steering vectors were applied-specifically the layer-wise insertion points and the scaling factors used during intervention-is also absent. Finally, without knowing the specific datasets utilized for the in-distribution and OOD AUROC evaluations, researchers cannot independently verify the threshold at which lexical cues fail and genuine semantic understanding begins.

Ultimately, tracking harmfulness directions through the OLMo training lifecycle provides a critical lens into how large language models internalize safety concepts. By demonstrating that representational shifts occur globally and early, and by proving the viability of pretraining steering vectors, this research pushes the alignment field toward more structural, mechanistic interventions. Acknowledging the severe limitations of in-distribution evaluations forces a necessary maturation in how the industry measures safety, moving away from reactive, post-hoc patching toward foundational, semantic control.

### Key Takeaways

*   Harmfulness representations in OLMo form a shared, structured geometric space rather than converging into a single direction.
*   The most significant geometric reorganization of harmfulness concepts occurs early in the pretraining phase.
*   In-distribution AUROC evaluations for safety are often misleading due to superficial lexical cues, necessitating out-of-distribution validation.
*   Steering vectors extracted from late pretraining checkpoints offer modest control, while post-training vectors are significantly more effective.

---

## Sources

- https://www.lesswrong.com/posts/nKfmimhTKCsDM2Hwh/harmfulness-directions-in-olmo-1
