Mechanistic interpretability

❧

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)

Some of this article's listed sources may not be reliable. Please help improve this article by looking for better, more reliable sources. Unreliable citations may be challenged and removed. (August 2025) (Learn how and when to remove this message)

This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources.
Find sources: "Mechanistic interpretability" – news · newspapers · books · scholar · JSTOR (August 2025) (Learn how and when to remove this message)

A major contributor to this article appears to have a close connection with its subject. It may require cleanup to comply with Wikipedia's content policies, particularly neutral point of view. Please discuss further on the talk page. (August 2025) (Learn how and when to remove this message)

This article contains wording that promotes the subject in a subjective manner without imparting real information. Please remove or replace such wording and instead of making proclamations about a subject's importance, use facts and attribution to demonstrate that importance. (August 2025) (Learn how and when to remove this message)

This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (August 2025) (Learn how and when to remove this message)

(Learn how and when to remove this message)

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Semi-supervised learning Self-supervised learning Reinforcement learning Meta-learning Online learning Batch learning Curriculum learning Rule-based learning Neuro-symbolic AI Neuromorphic engineering Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Neural networks Autoencoder Deep learning Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Boltzmann machine Restricted GAN Diffusion model SOM Convolutional neural network U-Net LeNet AlexNet DeepDream Neural field Neural radiance field Physics-informed neural networks Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning Policy gradient SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop Mechanistic interpretability RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory Topological deep learning
Journals and conferences AAAI ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Mechanistic interpretability (often shortened to mech interp, mechinterp or MI) is a subfield of research within explainable artificial intelligence, which seeks to fully reverse-engineer neural networks, with the goal of understanding the mechanisms underlying their computations.^[1]^{[non-primary source needed]}^[2]^[3] Recently the field has focused on large language models.

History

Chris Olah is credited with coining the term mechanistic interpretability.^[4] In the 2018 paper The Building Blocks of Interpretability, Olah and his colleagues combined existing interpretability techniques, including feature visualization, dimensionality reduction, and attribution, with human-computer interface methods to explore features represented by the neurons in the vision model Inception v1.^[5] In March 2020, Olah and the OpenAI Clarity team published Zoom In: An Introduction to Circuits, which described an approach inspired by neuroscience and cellular biology. They proposed that features function as the basis of neural network computation and connect together to form circuits.^[2]

In 2021, Chris Olah co-founded the company Anthropic and established its Interpretability team, which publishes their results on the Transformer Circuits Thread.^[6] In December 2021, the team published A Mathematical Framework for Transformer Circuits, reverse-engineering a toy transformer with one and two attention layers. Notably, they discovered the complete algorithm of induction circuits, responsible for in-context learning of repeated token sequences. The team further elaborated this result in the March 2022 paper In-context Learning and Induction Heads.^[7]

Notable results in mechanistic interpretability from 2022 include the theory of superposition wherein a model represents more features than there are directions in its representation space;^[8] a mechanistic explanation for grokking, the phenomenon where test-set loss begins to decay only after a delay relative to training-set loss;^[9] and the introduction of sparse autoencoders, a sparse dictionary learning method to extract interpretable features from LLMs.^[10]^[11]

Goodfire, an AI interpretability startup, was founded in 2024.^[12]

Mechanistic interpretability has greatly expanded its scope, practitioners, and attention in the ML community in recent years. In July 2024, the first ICML Mechanistic Interpretability Workshop was held, aiming to bring together "separate threads of work in industry and academia".^[13]

Definition

The term mechanistic interpretability designates both a class of technical methods and a research community. Chris Olah is usually credited with coining the term. His motivation was to differentiate this nascent approach to interpretability from established saliency map-based approaches which at the time dominated computer vision.^[14]^{[non-primary source needed]}

In-field explanations of the goal of mechanistic interpretability make an analogy to reverse-engineering computer programs,^[3]^[15] with the argument being that rather than being arbitrary functions, neural network's representations are composed of independent reverse-engineerable mechanisms that are compressed into the weights.

Mechanistic interpretability seeks to reverse engineer neural networks, similar to how one might reverse engineer a compiled binary computer program.

— Chris Olah, "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases"^[1] [emphasis added]

One emerging approach for understanding the internals of neural networks is mechanistic interpretability: reverse engineering the algorithms implemented by neural networks into human-understandable mechanisms, often by examining the weights and activations of neural networks to identify circuits [Cammarata et al., 2020, Elhage et al., 2021] that implement particular behaviors.

— Mechanistic Interpretability Workshop 2024^[16] [emphasis added]

Mechanistic interpretability's early development was rooted in the AI safety community, though the term is increasingly adopted by academia at large. In “Mechanistic?”, Saphra and Wiegreffe identify four senses of “mechanistic interpretability”:^[4]

Narrow technical definition: A technical approach to understanding neural networks through their causal mechanisms.

Broad technical definition: Any research that describes the internals of a model, including its activations or weights.

Narrow cultural definition: Any research originating from the MI community.

Broad cultural definition: Any research in the field of AI—especially LM—interpretability.

As the scope and popular recognition of mechanistic interpretability increase, many^[who?] have begun to recognize that other communities such as natural language processing researchers have pursued similar objectives in their work.

Key concepts

Linear representation hypothesis

The linear representation hypothesis (LRH) posits that high-level concepts are represented as linear representations in neural network activation space. This is an assumption that has been supported by empirical evidence, beginning with early work on word embeddings^[17] as well as more recent research in mechanistic interpretability.^[18]^[3]^{[non-primary source needed]}

Formalization of this assumption varies in the literature. Olah and Jermyn^[19]^{[non-primary source needed]} allow for higher-rank (i.e. not necessarily rank-1 as in prior work) linear representations and propose two key properties of such representations: (i) composition of features is represented by addition, and (ii) the intensity of a concept is represented by its magnitude.^[20]

Counterexamples to the LRH even as formalized above have been found, suggesting that it only holds for some features in some models. For example, the semantics of feature directions are both empirically and theoretically not scale-invariant in non-linear neural networks, lending support to an affine (not directional) view of features via the polytope lens.^[21]^{[unreliable source?]} A clear manifestation of this are "onion representations" in some RNNs trained on a sequence copying task, where the semantics of a feature varies with its scale.^[22]^{[unreliable source?]}

Superposition

Superposition is the phenomenon where many unrelated features are “packed’’ into the same subspace or even into single neurons, making a network highly over-complete yet still linearly decodable after nonlinear filtering.^[8] Recent formal analysis links the amount of polysemanticity to feature ‘‘capacity’’ and input sparsity, predicting when neurons become monosemantic or remain polysemantic.^[23]^{[unreliable source?]}

Methods

Probing

Probing involves training a linear classifier on model activations to test whether a feature is linearly decodable at a given layer or subset of neurons.^[24] Generally, a linear probe is trained on a labelled dataset encoding the desired feature. While linear probes are popular amongst mechanistic interpretability researchers, their introduction dates back to 2016 and it has been widely used in the NLP community.^[25]^{[unreliable source?]}

Nanda, Lee & Wattenberg (2023) showed that world-model features such as in-context truth values emerge as linearly decodable directions early in training, strengthening the case for linear probes as faithful monitors of internal state.

Difference-in-means

Difference-in-means, or diff-in-means constructs a steering vector by subtracting the mean activation for one class of examples from the mean for another. Unlike learned probes, diff-in-means has no trainable parameters and often generalises better out-of-distribution.^[26]^{[unreliable source?]} Diff-in-means has been used to isolate model representation for refusal/compliance, true/false, and sentiment.^[27]^{[unreliable source?]}^[26]^[28]^{[unreliable source?]}

Steering

Steering adds or subtracts a direction (often obtained via probing, diff-in-means, or K-means) from the residual stream to causally change model behavior.

Attribution

Causal interventions

While methods like probing allow for correlational understanding of model-internal components and representations, true reverse-engineering requires understanding the causal role of model internals. By treating neural networks as causal models, causal interventions (formalized in the do-calculus of Judea Pearl) enable answering this question.^[29]

Broadly, given a model ${\mathcal {M}}$ , a clean input $x_{\text{clean}}$ , a corrupted input $x_{\text{corrupt}}$ , and a subcomponent of interest $f$ , a causal intervention replaces the corrupted output representation of $f$ with that of the clean input, resulting in an intervened output ${\mathcal {M}}^{f\gets f(x_{\text{clean}})}(x_{\text{corrupt}})$ . If the subcomponent is causally relevant to the computation of ${\mathcal {M}}(x_{\text{clean}})$ , then this intervention should restore the clean output. A variety of dataset setups, evaluation metrics, and model subcomponent granularities have been studied using this approach.

Several causal intervention techniques have been proposed for mechanistic interpretability, including causal mediation analysis,^[30]^[31] interchange intervention (as part of the formal theory of causal abstraction),^[32]^{[unreliable source?]}^[33] and activation patching.^[34]^{[unreliable source?]}^[35] These all implement the same broad idea described above.

Gradient-based attribution

Causal intervention methods are expensive, requiring $O(n)$ forward passes for performing attribution on $n$ model components given a single input. Gradient-based methods propose using a single backward pass to compute approximations of the patching effect for every model component simultaneously.^[36] Methods in this vein include attribution patching,^[37]^[38]^{[unreliable source?]} and AtP*,^[39] along with standard interpretability techniques such as integrated gradients.^[40]^{[unreliable source?]}

Attribution patching uses a locally linear approximation of the gradient of a subcomponent representation to estimate its downstream patching effect. Formally, given a metric ${\mathcal {L}}$ , it computes the downstream effect as: $(f(x_{\text{corrupt}})-f(x_{\text{clean}}))^{\top }{\frac {\partial {\mathcal {L}}({\mathcal {M}}(x_{\text{clean}}))}{\partial f}}{\biggr |}_{f=f(x_{\text{clean}})}$

Sparse decomposition

A major goal of mechanistic interpretability is to decompose pre-trained neural networks into interpretable components.^[41] Existing architectural pieces of neural networks (e.g. attention heads, individual neurons) have been found to be uninterpretable, exhibiting "polysemanticity", i.e. implementing multiple behaviors at once. Sparse decomposition methods seek to discover the interpretable subcomponents of a model in a self-supervised fashion, building on intuitions from the linear representation hypothesis and superposition.

Sparse dictionary learning (SDL)

Sparse autoencoders (SAEs)

Sparse autoencoders (SAEs) for mechanistic interpretability were proposed in order to address the superposition problem by decomposing the feature space into an overcomplete basis (i.e. with more features than dimensions) of monosemantic concepts. The underlying intuition is that features can only be manipulable under superposition if they are sparsely activated (otherwise, interference between features would be too high).^[10]

Given a vector $\mathbf {x} \in \mathbb {R} ^{n}$ representing an activation collected from some model component (in a transformer, usually the MLP inner activation or the residual stream), the sparse autoencoder computes the following: ${\hat {\mathbf {x} }}=\mathbf {W} _{\mathrm {dec} }^{\top }(\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}}))+\mathbf {b} _{\textrm {dec}}$ Here, $\mathbf {W} _{\mathrm {enc} }\in \mathbb {R} ^{z\times n}$ projects the activation into a $z$ -dimensional latent space, applies the ReLU nonlinearity, and finally the decoder $\mathbf {W} _{\mathrm {dec} }\in \mathbb {R} ^{z\times n}$ aims to reconstruct the original activation from this latent representation. The bias terms are $\mathbf {b} _{\textrm {enc}}\in \mathbb {R} ^{z},\mathbf {b} _{\textrm {dec}}\in \mathbb {R} ^{n}$ ; the latter is omitted in some formulations. The encoder and decoder matrices may also be tied.

Given a dataset of activations $\mathbf {X} =\{\mathbf {x} _{1},\ldots ,\mathbf {x} _{n}\}$ , the SAE is trained with gradient descent to minimise the following loss function: ${\mathcal {L}}(\mathbf {x} )=||\mathbf {x} -{\hat {\mathbf {x} }}||_{2}^{2}+\alpha ||\mathbf {z} ||_{1}$ where the first term is the reconstruction loss (i.e. the standard autoencoding objective) and the second is a sparsity loss on the latent representation $\mathbf {z} =\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}})$ which aims to minimise its $\ell ^{1}$ -norm.

Alternative designs

Several works motivate alternative nonlinearities to ReLU based on improved downstream performance or training stability.

TopK, which means selecting the top- $K$ activating latents and zeroing out the rest, which also allows for dropping the sparsity loss entirely.^[42]^{[unreliable source?]}
JumpReLU, defined as $\mathrm {JumpReLU} (z)=zH(z-\theta )$ where $H$ is the Heaviside step function.^[43]^{[unreliable source?]} Anthropic adopted this modification for training SAEs and crosscoders.^[44]^{[non-primary source needed]}

Other architectural and loss function modifications include:

Gated SAE, which applies elementwise multiplicative gating to the encoder akin to gated linear units.^[45]^{[unreliable source?]}
Tanh penalty on the loss, to prevent feature shrinkage.^[46]^{[non-primary source needed]}

Evaluation

The core metrics for evaluating SAEs are sparsity, measured by the $\ell _{0}$ -norm of the latent representations over the dataset, and fidelity, which may be the MSE reconstruction error as in the loss function or a downstream metric when substituting the SAE output into the model, such as loss recovered or KL-divergence from the original model behaviour.^[47]^{[unreliable source?]}

SAE latents are usually labelled using an autointerpretability pipeline. Most such pipelines feed highly-activating (i.e. having activation $\mathbf {x}$ resulting in large $\mathbf {z} _{i}$ for feature $i$ , repeating over all features) dataset exemplars to a large language model, which generates a natural-language description based on the contexts the latent is active.

Early works directly adapt Bills (2023)'s neuron-labelling and evaluation pipeline and report higher interpretability scores than alternative methods (the standard basis, PCA, etc.).^[10]^[11] However, this leads to misleading score, since explanations achieve high recall but usually low precision, leading to more nuanced evaluation metrics being introduced in later works: neuron-to-graph explanations (or other approaches) reporting both precision and recall,^[42] and intervention-based metrics that measure the downstream effect of manipulating a latent feature.^[48]^{[unreliable source?]}^[49]^{[unreliable source?]}

Transcoders

Transcoders are formulated identically to SAEs, with the caveat that they seek to approximate the input-output behaviour of a model component (usually the MLP).^[50] This is useful for measuring how latent features in different layers of the model affect each other in an input-invariant manner (i.e. by directly comparing encoder and decoder weights). A transcoder thus computes the following: ${\hat {\mathbf {y} }}=\mathbf {W} _{\mathrm {dec} }^{\top }(\mathrm {ReLU} (\mathbf {W} _{\mathrm {enc} }\mathbf {x} +\mathbf {b} _{\textrm {enc}}))+\mathbf {b} _{\textrm {dec}}$ and is trained to minimise the loss: ${\mathcal {L}}(\mathbf {x} )=||\mathrm {MLP} (\mathbf {x} )-{\hat {\mathbf {y} }}||_{2}^{2}+\alpha ||\mathbf {z} ||_{1}$ When ignoring or holding attention components constant (which may obscure some information), transcoders trained on different layers of a model can then be used to conduct circuit analysis without having to process individual inputs and collect latent activations, unlike SAEs.

Transcoders generally outperform SAEs, achieving lower loss and better automated interpretability scores.^[50]

Crosscoders and cross-layer transcoders

A disadvantage of single-layer SAEs and transcoders is that they produce duplicate features when trained on multiple layers, if those features persist throughout the residual stream. This complicates understanding layer-to-layer feature propagation and also wastes latent parameters. Crosscoders were introduced to enable cross-layer representation of features, which minimizes these issues.^[51]^{[non-primary source needed]}^[52]^{[unreliable source?]} They outperform SAEs given the same feature budget but are worse on an equal FLOPs budget.

A crosscoder computes the cross-layer latent representation $\mathbf {z}$ using a set of layer-wise activations $\{\mathbf {x} ^{(l_{1})},\ldots ,\mathbf {x} ^{(l_{n})}\}$ over layers $L$ obatained from some input as follows: $\mathbf {z} =\mathrm {ReLU} \left(\sum _{l\in L}{\mathbf {W} _{\mathrm {enc} }^{(l)}\mathbf {x} ^{(l)}+\mathbf {b} _{\mathrm {enc} }}\right)$ The reconstruction is done independently for each layer using this cross-layer representation: ${\hat {\mathbf {x} }}^{(l)}=\mathbf {W} _{\mathrm {dec} }^{(l)}\mathbf {z} +\mathbf {b} _{\mathrm {dec} }$ Alternatively, the target may be layer-wise component outputs ${\hat {\mathbf {y} }}^{(l)}$ if using the transcoder objective. The model is then trained to minimise a loss: ${\mathcal {L}}(\{\mathbf {x} ^{(1)},\ldots ,\mathbf {x} ^{(n)}\})=\sum _{l\in L}{||{\hat {\mathbf {x} }}^{(l)}-\mathbf {x} ^{(l)}||_{2}^{2}+\sum _{l\in L}\sum _{i}{\mathbf {z} _{i}||\mathbf {W} _{\mathrm {dec} ,i}^{(l)}||_{2}}}$ Note that the regularisation term uses the $\ell ^{2}$ -norm; the $\ell ^{1}$ -norm is an alternative choice, considered but not used in the original paper.^[51]

Circuit tracing and automated graph discovery

Automated circuit discovery (ACDC) prunes the computational graph by iteratively patch-testing edges, localising minimal sub-circuits without manual hand-holding.^[53]^{[unreliable source?]}

Circuit tracing substitutes parts of the model, in particular the MLP block, with more interpretable components called "Transcoders". The goal is to recover explicit computational graphs. Like SAEs, circuit tracing uses sparse dictionary learning techniques. Instead of reconstructing model activations like SAEs, however, Transcoders aim to predict the output of non-linear components given their input. The technique was introduced in the paper "Circuit Tracing: Revealing Computational Graphs in Language Models", published in April 2025 by Anthropic. Circuit tracing has been used to understand how a model plans the rhyme in a poem, perform medical diagnosis, and understand chain of thought unfaithfulness.^[54]

References

1 2 Olah, Chris (June 2022). "Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases". Transformer Circuits Thread. Anthropic. Retrieved 28 March 2025.
1 2 Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan (2020). "Zoom In: An Introduction to Circuits". Distill. 5 (3). doi:10.23915/distill.00024.001.
1 2 3 Elhage, Nelson; et al. (2021). "A Mathematical Framework for Transformer Circuits". Transformer Circuits Thread. Anthropic.
1 2 Saphra, Naomi; Wiegreffe, Sarah (2024). "Mechanistic?". arXiv:2410.09087 [cs.AI].
↑ Olah, Chris; et al. (2018). "The Building Blocks of Interpretability". Distill. 3 (3). doi:10.23915/distill.00010.
↑ "Transformer Circuits Thread". transformer-circuits.pub. Retrieved 2025-05-12.
↑ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain, Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones, Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark, Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context Learning and Induction Heads". arXiv:2209.11895 [cs.LG].
1 2 Elhage, Nelson; Hume, Tristan; Olsson, Catherine; Schiefer, Nicholas; Henighan, Tom; Kravec, Shauna; Hatfield-Dodds, Zac; Lasenby, Robert; Drain, Dawn; Chen, Carol; Grosse, Roger; McCandlish, Sam; Kaplan, Jared; Amodei, Dario; Wattenberg, Martin; Olah, Christopher (2022). "Toy Models of Superposition". arXiv:2209.10652 [cs.LG].
↑ Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023). "Progress measures for grokking via mechanistic interpretability". ICLR.
1 2 3 Cunningham, Hoagy; Ewart, Aidan; Riggs, Logan; Huben, Robert; Sharkey, Lee (May 7–11, 2024). "Sparse Autoencoders Find Highly Interpretable Features in Language Models". The Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview.net. Retrieved 2025-04-29.
1 2 Bricken, Trenton; et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning". Transformer Circuits Thread. Retrieved 2025-04-29.
↑ Sullivan, Mark (2025-04-22). "This startup wants to reprogram the mind of AI—and just got $50 million to do it". Fast Company. Archived from the original on 2025-05-04. Retrieved 2025-05-12.
↑ "ICML 2024 Mechanistic Interpretability Workshop". icml2024mi.pages.dev. Retrieved 2025-05-12.
↑ @ch402 (July 29, 2024). "I was motivated by many of my colleagues at Google Brain being deeply skeptical of things like saliency maps. When I started the OpenAI interpretability team, I used it to distinguish our goal: understand how the weights of a neural network map to algorithms" (Tweet) – via Twitter.
↑ Nanda, Neel (January 31, 2023). "Mechanistic Interpretability Quickstart Guide". Neel Nanda. Retrieved 28 March 2025.
↑ "Mechanistic Interpretability Workshop 2024". 2024.
↑ Mikolov, Tomas; et al. (2013). "Linguistic Regularities in Continuous Space Word Representations". Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics. pp. 746–751. Retrieved 2025-07-01.
↑ Park, Kiho; Choe, Yo Joong; Veitch, Victor (2024). "The Linear Representation Hypothesis and the Geometry of Large Language Models". ICML.
↑ Olah, Chris; Jermyn, Adam (2024). "Circuits Updates - July 2024: What is a Linear Representation? What is a Multidimensional Feature?". Transformer Circuits Thread. Anthropic.
↑ Sharkey et al. (2025), p. 11.
↑ Black, Sid; et al. (2022). "Interpreting Neural Networks through the Polytope Lens". arXiv:2211.12312 [cs.LG].
↑ Csordás, Róbert; Potts, Christopher; Manning, Chris M.; Geiger, Atticus (2024). "Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations". arXiv:2408.10920 [cs.LG].
↑ Scherlis, Adam (2025). "Polysemanticity and Capacity in Neural Networks". arXiv:2210.01892 [cs.NE].
↑ Bereska, Leonard; Gavves, Efstratios (2024). "Mechanistic Interpretability for AI Safety -- A Review". TLMR.
↑ Alain, Guillaume; Bengio, Yoshua (2018). "Understanding intermediate layers using linear classifier probes". arXiv:1610.01644 [stat.ML].
1 2 Marks, Samuel; Tegmark, Max (2024). "The Geometry of Truth: Emergent Linear Structure in LLM Representations". arXiv:2310.06824 [cs.AI].
↑ Arditi, Andy; Obeso, Oscar; Syed, Aaquib; Paleka, Daniel; Panickssery, Nina; Gurnee, Wes; Nanda, Neel (2024). "Refusal in Language Models Is Mediated by a Single Direction". arXiv:2406.11717 [cs.LG].
↑ Tigges, Curt; Hollinsworth, Oskar John; Geiger, Atticus; Nanda, Neel (2023). "Linear Representations of Sentiment in Large Language Models". arXiv:2310.15154 [cs.LG].
↑ Sharkey et al. (2025), p. 16.
↑ Vig, Jesse; et al. (2020). "Investigating Gender Bias in Language Models Using Causal Mediation Analysis". Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Retrieved 2025-07-02.
↑ Meng, Kevin; et al. (2022). "Locating and Editing Factual Associations in GPT". Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Retrieved 2025-07-02.
↑ Geiger, Atticus; Richardson, Kyle; Potts, Christopher (2020). "Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation". arXiv:2004.14623 [cs.CL].
↑ Geiger, Atticus; et al. (2025). "Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability". JMLR. 26 (83): 1–64.
↑ Wang, Kevin; et al. (2022). "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small". arXiv:2211.00593 [cs.LG].
↑ Zhang, Fred; Nanda, Neel (2024). "Towards Best Practices of Activation Patching in Language Models: Metrics and Methods". International Conference on Learning Representations (ICLR). Retrieved 16 August 2025.
↑ Sharkey et al. 2025, p. 17.
↑ Nanda, Neel (2023). "Attribution Patching: Activation Patching At Industrial Scale".
↑ Syed, Aaquib; Rager, Can; Nanda, Neel (2023). "Attribution Patching Outperforms Automated Circuit Discovery". arXiv:2310.10348 [cs.LG].
↑ Kramár, János; et al. (2024). "AtP*: An efficient and scalable method for localizing LLM behaviour to components". arXiv:2403.00745 [cs.LG].
↑ Sundararajan, Mukund; et al. (2017). "Axiomatic Attribution for Deep Networks". arXiv:1703.01365 [cs.LG].
↑ Sharkey et al. 2025, p. 8.
1 2 Gao, Leo; et al. (2024). "Scaling and evaluating sparse autoencoders". arXiv:2406.04093 [cs.LG].
↑ Rajamanoharan, Senthooran; et al. (2024). "Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders". arXiv:2407.14435 [cs.LG].
↑ Conerly, Tom; et al. (2024). "Circuits Updates - January 2025: Dictionary Learning Optimization Techniques". Transformer Circuits Thread. Anthropic. Retrieved 2025-07-01.
↑ Rajamanoharan, Senthooran; et al. (2024). "Improving Dictionary Learning with Gated Sparse Autoencoders". arXiv:2404.16014 [cs.LG].
↑ Jermyn, Adam; et al. (2024). "Circuits Updates - February 2024: Tanh Penalty in Dictionary Learning". Transformer Circuits Thread. Anthropic. Retrieved 2025-07-01.
↑ Karvonen, Adam; et al. (2025). "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability". arXiv:2503.09532 [cs.LG].
↑ Paulo, Gonçalo; et al. (2024). "Automatically Interpreting Millions of Features in Large Language Models". arXiv:2410.13928 [cs.LG].
↑ Wu, Zhengxuan; et al. (2025). "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders". arXiv:2501.17148 [cs.CL].
1 2 Dunefsky, Jacob; et al. (December 10–15, 2024). "Transcoders find interpretable LLM feature circuits". Advances in Neural Information Processing Systems 38 (NeurIPS 2024). Vancouver, BC, Canada. Retrieved 2025-04-29.
1 2 Lindsey, Jack; et al. (2024). "Sparse Crosscoders for Cross-Layer Features and Model Diffing". Transformer Circuits Thread. Anthropic. Retrieved 2025-04-30.
↑ Gorton, Liv (2024). "Group Crosscoders for Mechanistic Analysis of Symmetry". arXiv:2410.24184 [cs.LG].
↑ Conmy, Arthur (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability". arXiv:2304.14997 [cs.LG].
↑ "Circuit Tracing: Revealing Computational Graphs in Language Models". Transformer Circuits. Retrieved 2025-06-30.

Sources

Nanda, Neel; Lee, Andrew; Wattenberg, Martin (2023). "Emergent Linear Representations in World Models of Self-Supervised Sequence Models". BlackNLP workshop.
Rai, Daking (2024). "A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models". CoRR.
Sharkey, Lee; et al. (2025). "Open Problems in Mechanistic Interpretability". arXiv:2501.16496 [cs.LG].^{[unreliable source?]}