Mechanistic interpretability

Wikipedia

Mechanistic interpretability (often shortened to mech interp, mechinterp or MI) is a subfield of research within explainable artificial intelligence, which seeks to fully reverse-engineer neural networks, with the goal of understanding the mechanisms underlying their computations.[1][non-primary source needed][2][3] Recently the field has focused on large language models.

History

Chris Olah is credited with coining the term mechanistic interpretability.[4] In the 2018 paper The Building Blocks of Interpretability, Olah and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution, with human-computer interface methods to explore features represented by the neurons in the vision model Inception v1.[5] In March 2020, Olah and the OpenAI Clarity team published Zoom In: An Introduction to Circuits, which described an approach inspired by neuroscience and cellular biology. They proposed that features function as the basis of neural network computation and connect together to form circuits.[2]

In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.[6] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.[7]

Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;[8] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;[9] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.[10][11]

Goodfire, an AI interpretability startup, was founded in 2024.[12]

Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".[13]

Definition

The term mechanistic interpretability designates both a class of technical methods and a research community. Chris Olah is usually credited with coining the term. His motivation was to differentiate this nascent approach to interpretability from established saliency map-based approaches which at the time dominated computer vision.[14][non-primary source needed]

In-field explanations of the goal of mechanistic interpretability make an analogy to reverse-engineering computer programs,[3][15] with the argument being that rather than being arbitrary functions, neural network's representations are composed of independent reverse-engineerable mechanisms that are compressed into the weights.

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

Chris Olah, "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases"[1] [emphasis added]

One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits [Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.

Mechanistic Interpretability Workshop 2024[16] [emphasis added]

Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia at large. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:[4]

  1. Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.
  2. Broad technical definition: Any research that describes the internals of a model, including its activations or weights.
  3. Narrow cultural definition: Any research originating from the MI community.
  4. Broad cultural definition: Any research in the field of AI—especially LM—interpretability.

As the scope and popular recognition of mechanistic interpretability increase, many[who?] have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.

Key concepts

Linear representation hypothesis

Simple word embeddings exhibit linear representation of semantics. The relationship between a country and its capital is encoded in a linear direction in the example above.

The linear representation hypothesis (LRH) posits that high-level concepts are represented as linear representations in neural network activation space. This is an assumption that has been supported by empirical evidence, beginning with early work on word embeddings[17] as well as more recent research in mechanistic interpretability.[18][3][non-primary source needed]

Formalization of this assumption varies in the literature. Olah and Jermyn[19][non-primary source needed] allow for higher-rank (i.e. not necessarily rank-1 as in prior work) linear representations and propose two key properties of such representations: (i) composition of features is represented by addition, and (ii) the intensity of a concept is represented by its magnitude.[20]

Counterexamples to the LRH even as formalized above have been found, suggesting that it only holds for some features in some models. For example, the semantics of feature directions are both empirically and theoretically not scale-invariant in non-linear neural networks, lending support to an affine (not directional) view of features via the polytope lens.[21][unreliable source?] A clear manifestation of this are "onion representations" in some RNNs trained on a sequence copying task, where the semantics of a feature varies with its scale.[22][unreliable source?]

Superposition

Superposition is the phenomenon where many unrelated features are “packed’’ into the same subspace or even into single neurons, making a network highly over-complete yet still linearly decodable after nonlinear filtering.[8] Recent formal analysis links the amount of polysemanticity to feature ‘‘capacity’’ and input sparsity, predicting when neurons become monosemantic or remain polysemantic.[23][unreliable source?]

Methods

Probing

Probing involves training a linear classifier on model activations to test whether a feature is linearly decodable at a given layer or subset of neurons.[24] Generally, a linear probe is trained on a labelled dataset encoding the desired feature. While linear probes are popular amongst mechanistic interpretability researchers, their introduction dates back to 2016 and it has been widely used in the NLP community.[25][unreliable source?]

Nanda, Lee & Wattenberg (2023) showed that world-model features such as in-context truth values emerge as linearly decodable directions early in training, strengthening the case for linear probes as faithful monitors of internal state.

Difference-in-means

Difference-in-means, or diff-in-means constructs a steering vector by subtracting the mean activation for one class of examples from the mean for another. Unlike learned probes, diff-in-means has no trainable parameters and often generalises better out-of-distribution.[26][unreliable source?] Diff-in-means has been used to isolate model representation for refusal/compliance, true/false, and sentiment.[27][unreliable source?][26][28][unreliable source?]

Steering

Steering adds or subtracts a direction (often obtained via probing, diff-in-means, or K-means) from the residual stream to causally change model behavior.

Attribution

Causal interventions

While methods like probing allow for correlational understanding of model-internal components and representations, true reverse-engineering requires understanding the causal role of model internals. By treating neural networks as causal models, causal interventions (formalized in the do-calculus of Judea Pearl) enable answering this question.[29]

Broadly, given a model , a clean input , a corrupted input , and a subcomponent of interest , a causal intervention replaces the corrupted output representation of with that of the clean input, resulting in an intervened output . If the subcomponent is causally relevant to the computation of , then this intervention should restore the clean output. A variety of dataset setups, evaluation metrics, and model subcomponent granularities have been studied using this approach.

Several causal intervention techniques have been proposed for mechanistic interpretability, including causal mediation analysis,[30][31] interchange intervention (as part of the formal theory of causal abstraction),[32][unreliable source?][33] and activation patching.[34][unreliable source?][35] These all implement the same broad idea described above.

Gradient-based attribution

Causal intervention methods are expensive, requiring forward passes for performing attribution on model components given a single input. Gradient-based methods propose using a single backward pass to compute approximations of the patching effect for every model component simultaneously.[36] Methods in this vein include attribution patching,[37][38][unreliable source?] and AtP*,[39] along with standard interpretability techniques such as integrated gradients.[40][unreliable source?]

Attribution patching uses a locally linear approximation of the gradient of a subcomponent representation to estimate its downstream patching effect. Formally, given a metric , it computes the downstream effect as:

Sparse decomposition

A major goal of mechanistic interpretability is to decompose pre-trained neural networks into interpretable components.[41] Existing architectural pieces of neural networks (e.g. attention heads, individual neurons) have been found to be uninterpretable, exhibiting "polysemanticity", i.e. implementing multiple behaviors at once. Sparse decomposition methods seek to discover the interpretable subcomponents of a model in a self-supervised fashion, building on intuitions from the linear representation hypothesis and superposition.

Sparse dictionary learning (SDL)

Sparse autoencoders (SAEs)

Sparse autoencoders (SAEs) for mechanistic interpretability were proposed in order to address the superposition problem by decomposing the feature space into an overcomplete basis (i.e. with more features than dimensions) of monosemantic concepts. The underlying intuition is that features can only be manipulable under superposition if they are sparsely activated (otherwise, interference between features would be too high).[10]

Given a vector representing an activation collected from some model component (in a transformer, usually the MLP inner activation or the residual stream), the sparse autoencoder computes the following: Here, projects the activation into a -dimensional latent space, applies the ReLU nonlinearity, and finally the decoder aims to reconstruct the original activation from this latent representation. The bias terms are ; the latter is omitted in some formulations. The encoder and decoder matrices may also be tied.

Given a dataset of activations , the SAE is trained with gradient descent to minimise the following loss function: where the first term is the reconstruction loss (i.e. the standard autoencoding objective) and the second is a sparsity loss on the latent representation which aims to minimise its -norm.

Alternative designs

Several works motivate alternative nonlinearities to ReLU based on improved downstream performance or training stability.

  • TopK, which means selecting the top- activating latents and zeroing out the rest, which also allows for dropping the sparsity loss entirely.[42][unreliable source?]
  • JumpReLU, defined as where is the Heaviside step function.[43][unreliable source?] Anthropic adopted this modification for training SAEs and crosscoders.[44][non-primary source needed]

Other architectural and loss function modifications include:

Evaluation

The core metrics for evaluating SAEs are sparsity, measured by the -norm of the latent representations over the dataset, and fidelity, which may be the MSE reconstruction error as in the loss function or a downstream metric when substituting the SAE output into the model, such as loss recovered or KL-divergence from the original model behaviour.[47][unreliable source?]

SAE latents are usually labelled using an autointerpretability pipeline. Most such pipelines feed highly-activating (i.e. having activation resulting in large for feature , repeating over all features) dataset exemplars to a large language model, which generates a natural-language description based on the contexts the latent is active.

Early works directly adapt Bills (2023)'s neuron-labelling and evaluation pipeline and report higher interpretability scores than alternative methods (the standard basis, PCA, etc.).[10][11] However, this leads to misleading score, since explanations achieve high recall but usually low precision, leading to more nuanced evaluation metrics being introduced in later works: neuron-to-graph explanations (or other approaches) reporting both precision and recall,[42] and intervention-based metrics that measure the downstream effect of manipulating a latent feature.[48][unreliable source?][49][unreliable source?]

Transcoders

Transcoders are formulated identically to SAEs, with the caveat that they seek to approximate the input-output behaviour of a model component (usually the MLP).[50] This is useful for measuring how latent features in different layers of the model affect each other in an input-invariant manner (i.e. by directly comparing encoder and decoder weights). A transcoder thus computes the following: and is trained to minimise the loss: When ignoring or holding attention components constant (which may obscure some information), transcoders trained on different layers of a model can then be used to conduct circuit analysis without having to process individual inputs and collect latent activations, unlike SAEs.

Transcoders generally outperform SAEs, achieving lower loss and better automated interpretability scores.[50]

Crosscoders and cross-layer transcoders

A disadvantage of single-layer SAEs and transcoders is that they produce duplicate features when trained on multiple layers, if those features persist throughout the residual stream. This complicates understanding layer-to-layer feature propagation and also wastes latent parameters. Crosscoders were introduced to enable cross-layer representation of features, which minimizes these issues.[51][non-primary source needed][52][unreliable source?] They outperform SAEs given the same feature budget but are worse on an equal FLOPs budget.

A crosscoder computes the cross-layer latent representation using a set of layer-wise activations over layers obatained from some input as follows: The reconstruction is done independently for each layer using this cross-layer representation: Alternatively, the target may be layer-wise component outputs if using the transcoder objective. The model is then trained to minimise a loss: Note that the regularisation term uses the -norm; the -norm is an alternative choice, considered but not used in the original paper.[51]

Circuit tracing and automated graph discovery

Automated circuit discovery (ACDC) prunes the computational graph by iteratively patch-testing edges, localising minimal sub-circuits without manual hand-holding.[53][unreliable source?]

Circuit tracing substitutes parts of the model, in particular the MLP block, with more interpretable components called "Transcoders". The goal is to recover explicit computational graphs. Like SAEs, circuit tracing uses sparse dictionary learning techniques. Instead of reconstructing model activations like SAEs, however, Transcoders aim to predict the output of non-linear components given their input. The technique was introduced in the paper "Circuit Tracing: Revealing Computational Graphs in Language Models", published in April 2025 by Anthropic. Circuit tracing has been used to understand how a model plans the rhyme in a poem, perform medical diagnosis, and understand chain of thought unfaithfulness.[54]

References

  1. 1 2 Olah, Chris (June 2022). "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". Transformer Circuits Thread. Anthropic. Retrieved 28 March 2025.
  2. 1 2 Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001.
  3. 1 2 3 Elhage, Nelson; et al. (2021). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. Anthropic.
  4. 1 2 Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
  5. Olah, Chris; et al. (2018). "The Building Blocks of Interpretability". Distill. 3 (3). doi:10.23915/distill.00010.
  6. "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.
  7. Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].
  8. 1 2 Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
  9. Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". ICLR.
  10. 1 2 3 Cunningham, Hoagy; Ewart, Aidan; Riggs, Logan; Huben, Robert; Sharkey, Lee (May 7–11, 2024). "Sparse Autoencoders Find Highly Interpretable Features in Language Models". The Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview.net. Retrieved 2025-04-29.
  11. 1 2 Bricken, Trenton; et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning". Transformer Circuits Thread. Retrieved 2025-04-29.
  12. Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.
  13. "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
  14. @ch402 (July 29, 2024). "I was motivated by many of my colleagues at Google Brain being deeply skeptical of things like saliency maps. When I started the OpenAI interpretability team, I used it to distinguish our goal: understand how the weights of a neural network map to algorithms" (Tweet) via Twitter.
  15. Nanda, Neel (January 31, 2023). "Mechanistic Interpretability Quickstart Guide". Neel Nanda. Retrieved 28 March 2025.
  16. "Mechanistic Interpretability Workshop 2024". 2024.
  17. Mikolov, Tomas; et al. (2013). "Linguistic Regularities in Continuous Space Word Representations". Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics. pp. 746–751. Retrieved 2025-07-01.
  18. Park, Kiho; Choe, Yo Joong; Veitch, Victor (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models". ICML.
  19. Olah, Chris; Jermyn, Adam (2024). "Circuits Updates - July 2024: What is a Linear Representation? What is a Multidimensional Feature?". Transformer Circuits Thread. Anthropic.
  20. Sharkey et al. (2025), p. 11.
  21. Black, Sid; et al. (2022). "Interpreting Neural Networks through the Polytope Lens". arXiv:2211.12312 [cs.LG].
  22. Csordás, Róbert; Potts, Christopher; Manning, Chris M.; Geiger, Atticus (2024). "Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations". arXiv:2408.10920 [cs.LG].
  23. Scherlis, Adam (2025). "Polysemanticity and Capacity in Neural Networks". arXiv:2210.01892 [cs.NE].
  24. Bereska, Leonard; Gavves, Efstratios (2024). "Mechanistic Interpretability for AI Safety -- A Review". TLMR.
  25. Alain, Guillaume; Bengio, Yoshua (2018). "Understanding intermediate layers using linear classifier probes". arXiv:1610.01644 [stat.ML].
  26. 1 2 Marks, Samuel; Tegmark, Max (2024). "The Geometry of Truth: Emergent Linear Structure in LLM Representations". arXiv:2310.06824 [cs.AI].
  27. Arditi, Andy; Obeso, Oscar; Syed, Aaquib; Paleka, Daniel; Panickssery, Nina; Gurnee, Wes; Nanda, Neel (2024). "Refusal in Language Models Is Mediated by a Single Direction". arXiv:2406.11717 [cs.LG].
  28. Tigges, Curt; Hollinsworth, Oskar John; Geiger, Atticus; Nanda, Neel (2023). "Linear Representations of Sentiment in Large Language Models". arXiv:2310.15154 [cs.LG].
  29. Sharkey et al. (2025), p. 16.
  30. Vig, Jesse; et al. (2020). "Investigating Gender Bias in Language Models Using Causal Mediation Analysis". Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Retrieved 2025-07-02.
  31. Meng, Kevin; et al. (2022). "Locating and Editing Factual Associations in GPT". Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Retrieved 2025-07-02.
  32. Geiger, Atticus; Richardson, Kyle; Potts, Christopher (2020). "Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation". arXiv:2004.14623 [cs.CL].
  33. Geiger, Atticus; et al. (2025). "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability". JMLR. 26 (83): 1–64.
  34. Wang, Kevin; et al. (2022). "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small". arXiv:2211.00593 [cs.LG].
  35. Zhang, Fred; Nanda, Neel (2024). "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods". International Conference on Learning Representations (ICLR). Retrieved 16 August 2025.
  36. Sharkey et al. 2025, p. 17.
  37. Nanda, Neel (2023). "Attribution Patching: Activation Patching At Industrial Scale".
  38. Syed, Aaquib; Rager, Can; Nanda, Neel (2023). "Attribution Patching Outperforms Automated Circuit Discovery". arXiv:2310.10348 [cs.LG].
  39. Kramár, János; et al. (2024). "AtP*: An efficient and scalable method for localizing LLM behaviour to components". arXiv:2403.00745 [cs.LG].
  40. Sundararajan, Mukund; et al. (2017). "Axiomatic Attribution for Deep Networks". arXiv:1703.01365 [cs.LG].
  41. Sharkey et al. 2025, p. 8.
  42. 1 2 Gao, Leo; et al. (2024). "Scaling and evaluating sparse autoencoders". arXiv:2406.04093 [cs.LG].
  43. Rajamanoharan, Senthooran; et al. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders". arXiv:2407.14435 [cs.LG].
  44. Conerly, Tom; et al. (2024). "Circuits Updates - January 2025: Dictionary Learning Optimization Techniques". Transformer Circuits Thread. Anthropic. Retrieved 2025-07-01.
  45. Rajamanoharan, Senthooran; et al. (2024). "Improving Dictionary Learning with Gated Sparse Autoencoders". arXiv:2404.16014 [cs.LG].
  46. Jermyn, Adam; et al. (2024). "Circuits Updates - February 2024: Tanh Penalty in Dictionary Learning". Transformer Circuits Thread. Anthropic. Retrieved 2025-07-01.
  47. Karvonen, Adam; et al. (2025). "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability". arXiv:2503.09532 [cs.LG].
  48. Paulo, Gonçalo; et al. (2024). "Automatically Interpreting Millions of Features in Large Language Models". arXiv:2410.13928 [cs.LG].
  49. Wu, Zhengxuan; et al. (2025). "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders". arXiv:2501.17148 [cs.CL].
  50. 1 2 Dunefsky, Jacob; et al. (December 10–15, 2024). "Transcoders find interpretable LLM feature circuits". Advances in Neural Information Processing Systems 38 (NeurIPS 2024). Vancouver, BC, Canada. Retrieved 2025-04-29.
  51. 1 2 Lindsey, Jack; et al. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing". Transformer Circuits Thread. Anthropic. Retrieved 2025-04-30.
  52. Gorton, Liv (2024). "Group Crosscoders for Mechanistic Analysis of Symmetry". arXiv:2410.24184 [cs.LG].
  53. Conmy, Arthur (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability". arXiv:2304.14997 [cs.LG].
  54. "Circuit Tracing: Revealing Computational Graphs in Language Models". Transformer Circuits. Retrieved 2025-06-30.

Sources