With dictionary learning, we dig in a small transformer trained on a synthetic task and find a number of human-understandable fine-grained information flow inside of it.

An informal post to share our recent progress on circuit discovery with dictionary learning. We recommend readers to see our first arxiv version for cleaner demonstration (More experiments coming up). Open source code is to be published at our GitHub repository.

Introduction

In recent years, advances in transformer-based language models have sparked interest in better understanding the internal computational workings of these systems. Researchers have made some progress in identifying interpretable circuits and algorithms within GPT-2, but much of the model's broad language generation capabilities remain opaque. The emerging field of Mechanistic Interpretability aims to reverse engineer neural networks in order to map their internal components to understandable computational primitives. By decomposing these black box systems into basic building blocks carrying out discrete functions, the goal is to shed light on how complex behaviors like language modeling emerge from combinations of simple computational elements.

In the mechanistic view of transformers, understanding model activations is a central task. Activations answer an important question in mechanistic interpretability: what high-level features does the model compute? Recent advances in sparse dictionary learning have opened up new possibilities for extracting more interpretable, monosemantic features out of superposition. Hypothesis of linear representation grants researchers the convenience of feature superposition and attacking the dimension curse. By learning sparse dictionaries that decompose activations into semantically meaningful directions in the representation space, we are able to gain more microscopic insight of model representations.

This work proposes a circuit discovery framework utilizing sparse dictionaries to decompose activation spaces into interpretable information flows that can be traced through a subset of layers or end-to-end in the model. Our framework aims to answer three questions:

These questions as a whole explains almost every property of a model. They, however, are somehow orthogonal in our research agenda. Developing dictionary training techniques and better interpreting methods of dictionary features is crucial in the first question. Prior work has made some advances on this topic. The second question is in comparison less discussed in the literature. The last can be solved with existing techniques in Mech Interp.

We apply our theoretical framework to analyze a decoder-only transformer trained on a synthetic task called Othello. Experiments provide concrete evidence that dictionary learning can extract interpretable features and improve end-to-end circuit discovery. Moreover, we are able to determine how a given feature is activated by its lower-level computations, which has been challenging for existing mechanistic interpretability methods like probing and patching.

Summary of Results

Our Theoratical Framework

Sparse dictionary has shown its great potential on extracting monosemantic features in transformers in an unsupervised manner. It works to an unprecedented extent on multiple positions of the residual stream and on multiple sizes of models. We claim that dictionary learning also grants new possibilities for circuit discovery. We can start from any feature (or model output), recursively tracing down to the input embedding to find one (or a group) of local (or end-to-end) circuits.

We follow A Mathematical Framework of Transformer Circuits to divide transformers into 3 main components: QK, OV and MLP. We correspondingly answer the following questions:

Interpretable Sparse Coding with Dictionary Learning

If the internal structure of Transformers were more interpretation-friendly, with linear interpretable features neatly corresponding to neurons and their activation strengths, it would be straightforward to analyze each neuron's purpose based on when it activates, and understand the model's reasoning by looking at each neuron's activation value for a given output. This assumption has major limitations , but interpretable neurons can still be found in many language models , vision models , and multimodal models . From the most intuitive perspective, even without considering any high-level features, if the vocabulary size is greater than the hidden layer dimension, the model has to cram more features than its dimensionality into a limited space. Superposition hypothesis assumes that features are "squeezed" into this crowded space in an interfering way. In this case, there must exist at least one feature whose direction in the representational space does not align with any one neuron.

Furthermore, prior work on Privileged Bases suggests models represent features through a mixture of:

Due to the aforementioned issues with understanding neural network model internal from the neuron perspective, there is a need for a more general approach to find an "interpretable basis" constituting these explainable directions, which are quite likely to be overcomplete. Moreover, a major drive behind superposition is the sparsity of the features, which is also an important property often present in real-world tasks. Therefore, utilizing sparse dictionary learning to extract features aligns well with these two important properties of overcompleteness and sparsity. Overall, the goal of sparse dictionary learning is to find, through an autoencoder, a set of overcomplete bases \mathbf{d} such that for any activation \mathbf{x} at a given model location, it can be decomposed into a sparse weighted sum over this set of bases:

\mathbf{x} \approx \sum_{i}{c_i \mathbf{d}_i};\quad \text{s.t.} \quad \min{\lVert \{c_i | c_i\in \mathbf{c}, c_i > 0\} \rVert}

c_i is the activation magnitude of the i-th dictionry feature. By constraining the sparsity of the activations over the dictionary features, the dictionary is forced to find the fundamental features implicitly contained in the representation and compute the most sparse (under a given metric) composition over this set of features that can reconstruct the given representation.

For a model activation captured at a given position in the transformer, we can decompose it into a weighted sum of a group of (more) interpretable dictionary features:

Where Should We Train Dictionary on?

Viewing the residual stream as the memory management center of transformer, and each Attention and MLP module read from and write to it, is an important concept for understanding how information flows in transformers. Previous work based on dictionary learning has typically studied: word embedding, residual streams, and MLP hidden layers. Here is a brief commentary on these works:

Based on the above analysis, we believe it could be beneficial to use dictionary learning to decompose the following three parts: word representations, the output of each Attention layer, and the output of each MLP layer. Although there has already been considerable Mechanistic Interpretability work analyzing Attention heads compared to MLPs, we think incorporating them into a unified dictionary learning framework is necessary. This setting would be helpful for understanding Transformers in a systematic and scalable way.

Module Input Decomposition

As shown above, the input of any Attention or MLP block x can be linearized as the sum of all outputs of its lower module. For example, the input of L1M (MLP block in layer 1) can be decomposed into 4 parts:

x_{\text{L1M}} = \text{LN}(\textbf{Embed} + \textbf{Out}_{\text{L0A}} + \textbf{Out}_{\text{L0M}} + \textbf{Out}_{\text{L1A}})

Dealing with Non-linearity of LayerNorm

In prevalent transformer architectures, LayerNorm is usually applied after each module makes a copy of the residual flow, i.e. pre-norm. Although the input to each module can be linearly decomposed into the sum of outputs from all bottom modules, LayerNorm itself is not a linear operation. This prevents us from attributing a certain consequence to each linear component, which is an important issue to resolve for facilitating subsequent analyses.

x = x - x.mean() # Linear Operation x = x * (1 / x.std()) # Non-Linear Operation !! x = ln.w @ x + ln.b # Linear Operation

The above pseudocode describes the computation process of LayerNorm, where the step of calculating the standard deviation is non-linear to the input x. To address this, we treat the standard deviation of x as a constant rather than a function of x. This allows us to transform LayerNorm into a linear function of x without changing the computational result. With this transformation, we can now apply the modified LayerNorm separately to any linear decomposition of x to estimate the impact of each component on the result. This resolves the issue of analyzing LayerNorm for facilitating subsequent analyses.

Specifically, when we are interested in the final LayerNorm before the unembedding, this technique can be used to analyze the impact of each module on the output logits. This technique is called Direct Logit Attribution and has been applied in quite a few Mech Interp works.

Dissecting OV Circuits

Each attention head needs to transfer the input of token j to token i through the following process:

\textbf{OV}^h_{i \gets j} = \textbf{AttnPattern}^h_{ij}W_O^hW^h_Vx_j

The superscript h indicates that the corresponding model parameters/activations are independent for each head. \textbf{AttnPattern}^h_{ij} represents the weight coefficients for information transfer, W^h_OW^h_V are the OV weight matrices for this attention head, which can be intuitively understood as additional processing when transferring information. x_j is the input to this attention module, which is the same for each head.

Due to the independent additivity of multi-head attention, the output of the attention module at token i can be expressed as:

\textbf{Out}_{\text{LXA},i} = \sum_h \sum_j \textbf{OV}^h_{i \gets j} = \sum_h \sum_j \textbf{AttnPattern}^h_{ij}W_O^hW^h_Vx_j

Therefore, the output of an attention module at token i is the sum of the outputs from each head, and the output of each head is in turn the sum of the OV outputs computed by that head over every other token. Furthermore, we can also decompose the input to this module at each of the other tokens into the sum of all of its bottom module outputs:

\textbf{Out}_{\text{LXA},i} = \sum_h \sum_j \textbf{AttnPattern}^h_{ij}W_O^hW^h_V \text{LN}(\sum_{m \in\text{Bottom Modules of LXA}}\textbf{Out}_m)

Moreover, we decompose the output of each module into weighted sum of dictionary features:

\textbf{Out}_{\text{LXA},i} \approx \sum_h \sum_j \textbf{AttnPattern}^h_{ij}W_O^hW^h_V \text{LN}(\sum_{m \in\text{Bottom Modules of LXA}}(\sum_{k\in {\text{Dict }m}} {c^m_k \mathbf{d}^m_k}))

Each dictionary decomposition is composed of a group of activation magnitude c^{\text{L.C.}}_k multiplying a unit vector standing for the feature direction \mathbf{d}^{\text{L.C.}}_k.

Dictionary encoder of LXA takes in \textbf{Out}_{\text{LXA},i} and applies a linear map \{ W^{\text{LXA}}_{e, Y}, \mathbf{b}^{\text{LXA}}_{e, Y}\} for the activation magnitude of feature Y:

\tilde{\mathbf{c}}_{\text{LXAY}, i} \approx W^{\text{LXA}}_{e, Y} \sum_h \sum_j \textbf{AttnPattern}^h_{ij}W_O^hW^h_V \text{LN}(\sum_{m \in\text{Bottom Modules of LXA}}(\sum_{k\in {\text{Dict }m}} {c^m_k \mathbf{d}^m_k})) + \mathbf{b}^{\text{LXA}}_{e, Y}

By utilizing the linearized LayerNorm introduced in the last section, we manage to attribute the activation magnitude of Y-th dictionary feature in the i-th token to all dictionary features in the bottom of LXA of all tokens.

Attributing Attention Patterns to Dictionary Features

In LXA, each head determines how much proportion of its attention from token i be assigned to token j through the following process:

\textbf{AttnPattern}_{ij} = \text{Softmax}(x_iW_QW^\mathrm{T}_Kx^\mathrm{T})_j

x\in \mathrm{R}^{L \times D} is the input of all tokens to LXA,x_i\in \mathrm{R}^{1 \times D} is input of token i to LXA. W_Q, W_K \in \mathrm{R}^{D \times d} is the QK circuit of the given attention head. D, d stands for the hidden dimension of the model and the head, respectively. We omit the superscript indicating head index since QK circuit is independent in each head of LXA.

We denote the item before Softmax x_iW_QW^\mathrm{T}_Kx^\mathrm{T} Attention Score. As an important nonlinear module, Softmax normalizes the Attention Score, and this processing causes any change in the elements of the attention scores to affect the attention strengths across the entire sequence, making it difficult for explanation. Therefore, it is challenging to fully understand the attention scores across the entire sequence. We have to settle for seeking a more qualitative question: For a given attention head in LXA that assigns attention between two tokens, which pair of features in their respective residual streams has generated a stronger "resonance"?

In particular, since input of each token to LXA can be decomposed to dictionary features of bottom modules in its own residual stream, i.e.

x_{\text{LXA}} = \text{LN}(\sum_{m \in\text{Bottom Modules of LXA}}\textbf{Out}_m) \approx \text{LN}(\sum_{m \in\text{Bottom Modules of LXA}}(\sum_{k\in {\text{Dict }m}} {c^m_k \mathbf{d}^m_k}))

By further denoting the sum of all dictionary features of all bottom modules as a whole, we get a clear bilinear form to dissect \textbf{AttnScore}_{ij} (with linearized LayerNorm):

\textbf{AttnScore}_{ij} \approx \underbrace{\text{LN}(\sum_s c_s \mathbf{d}_s)}_{\text{Residual Stream of the } i\text{-th token}}W_QW^\mathrm{T}_K\underbrace{\text{LN}(\sum_t c_t \mathbf{d}_t)^\mathrm{T}}_{\text{Residual Stream of the } j\text{-th token}}

Thus for any given \textbf{AttnScore}_{ij}, we manage to decompose it into the sum of contributions of feature pairs in the two corresponding residual stream.

Attributing MLP features

The MLP module accounts for over half of the parameters in Transformer models, yet research on interpreting MLPs is significantly less than for attention modules in the field. This may be because the MLP itself has a simpler form, while neuron-based research has certain limitations. Under the framework of dictionary learning, we appear able to understand more about the internal features of the MLP.

The output of LXM can be written as follows:

\textbf{Out}_{\text{LXM}} = \text{MLP}_{\text{X}}(x_\text{LXM}) \approx \text{MLP}_{\text{X}}(\text{LN}(\sum_{m \in\text{Bottom Modules of LXM}}(\sum_{k\in {\text{Dict }m}} {c^m_k \mathbf{d}^m_k})))

Again, by viewing all dictionary features in one residual stream as a whole, we thus get the activation of the Y-th dictionary feature of the X-th layer:

c_{\text{LXMY}} = \text{ReLU}(W^{\text{LXM}}_{e, Y} \text{MLP}_{\text{X}}(\text{LN}(\sum_s c_s \mathbf{d}_s)) + \mathbf{b}^{\text{LXM}}_{e, Y})

W^{\text{LXM}}_{e, Y} is the Y-th row of LXM dictionary encoder. \mathbf{b}^{\text{LXM}}_{e, Y} is its corresponding encoder bias.

We conjecture that a given MLP feature is activated by a small subset of lower features. To verify this, we need to measure the contribution of each lower feature to the MLP feature. In addition, if this is true, it would be inspiring to identify these core contributors.

Definition

We define the approximate direct contribution \textbf{ADC} of each feature c_s \mathbf{d}_s composing x to the output \textbf{Out}:

\textbf{ADC}(c_s \mathbf{d}_s) = W^{\text{LXM}}_{e, Y} W_{out}(\underbrace{(W_{in}c_s \mathbf{d}_s)}_{\text{Dictionary Feature}}\cdot\overbrace{\sigma(W_{in}x)}^{\text{Leave MLP input unchanged for }\sigma})

We omitted LayerNorm in the definition above for simplicity since it can be linearized. Then the dictionary feature of an MLP output can be written as:

c_{\text{LXMY}} = \text{ReLU}(\sum_s \textbf{ADC}(c_s \mathbf{d}_s) + \mathbf{b}^{\text{LXM}}_{e, Y})

The intuition behind \textbf{ADC} is quite simple. The non-linear MLP is transformed into a linear function of contributions of input features. Just like how we deal with LayerNorm, we treat the input of the activation function as a constant, and the other item as a linear function of the input features.

Properties

The non-linearity introduced by the activation function Act_fn in MLP hidden layers is key to interpretability. We find that prevalent activation functions are always in some form of self-gating x\cdot\sigma(x): For ReLU, \sigma(x) is the sign function; for SiLU it is the sigmoid function; for GELU it is the cumulative distribution function of a Gaussian distribution. These functions all take values between 0-1 and are monotonically non-decreasing.

\text{Act\_fn}(W_{in}x) = (W_{in}x) \cdot \sigma(W_{in}x)

We consider the effect of input features on each MLP neuron since the activation of any dictionary feature is a linear function of MLP neurons.

If we consider the contribution of a single input feature c_s \textbf{d}_s to the output feature c_{\text{LXMY}}, there are only two sources of contribution: neurons with a positive (linear) effect on c_{\text{LXMY}} are activated by c_s \textbf{d}_s and neurons having a negative (linear) effect on c_{\text{LXMY}} are activated by c_s \textbf{d}_s.

Approximate direct contribution can capture the former type of contribution since for any monotonically non-decreasing non-negative self-gating function \sigma(x), \frac{\partial}{\partial x} x\sigma(x) is always positive. Specifically, if an input feature c_s \textbf{d}_s adds on a pre-activation MLP neuron, and that neuron is activated via \sigma(x) and contributes to some output feature c_{\text{LXMY}}, then this neuron propagates a positive contribution from c_s \textbf{d}_s to c_{\text{LXMY}}. Summing up such contribution (can be positive or negative) across all neurons results in \textbf{ADC}(c_s \textbf{d}_s).

However, the latter type of contribution, suppressing neurons with negative effects, is not captured well by \textbf{ADC}. In this case, \textbf{ADC}(c_s \textbf{d}_s) is always zero.

Experiments

Game of Othello

We used a 1.2M parameter decoder-only Transformer to learn a programmable game prediction task named Othello. The model only learns to play legal moves, not tactics.

The rule of Othello is as follows: Two players compete, using 64 identical game pieces ("disks") that are light on one side and dark on the other. Each player chooses one color to use throughout the game. Players take turns placing one disk on an empty tile, with their assigned color facing up. After a play is made, any disks of the opponent's color that lie in a straight line bounded by the one just played and another one in the current player's color are turned over.

The figure above shows the progress of a gameThere is a long-standing plotly bug where parts of the canvas are not cleared when the animation goes back to previous frames.. The chessboard starts with 4 pieces placed in an interleaved pattern. The two sides keep placing pieces until the entire board is filledThere are rare cases where the board is not completely filled but no more moves can be made, we do not consider such rare cases here. The goal is to occupy more positions on the board at the end.

As shown in the figure, there are 60 empty slots on the chessboard, so the game lasts 60 moves in total. By recording the position of each move, we can represent a game using a sequence of length 60:

[37, 29, 18, 45, 22, 26, 19, 12, 54, 53, 25, 44, 38, 21, 62, 34, 46, 63, 13, 55, 42, 30, 23, 17, 5, 39, 11, 15, 16, 60, 7, 20, 43, 24, 31, 61, 32, 4, 47, 2, 10, 52, 51, 9, 0, 14, 33, 58, 59, 41, 49, 50, 3, 6, 57, 48, 56, 1, 8, 40]

By sampling a position from the set of valid moves at each step, we can generate millions of such game records. Our setup is to model these sequences in an auto-regressive manner. Just like a language model predicts the probability of the next word, this task models the probability of the next legal move.

This task was originally proposed in this ICLR 2023 spotlight paper . The main idea of the original work was to validate the formation of world models on this task. Through clever probing and interventions, the authors provided strong evidence for world model formation: the model first computes the state of the board and makes decisions accordingly, rather than simply memorizing surface patterns. Such tasks are more complex than previous addition, multiplication, and division tasks , yet simpler than language models.

One ingenious aspect of this task is that the input sequence itself provides very little information: only the order of moves, without the current state of the board. In addition, the model has no prior knowledge about the board or rules - it does not know the input sequence unfolds according to alternating players, nor the mapping between the input sequence and board positions. Given such difficult conditions, it is remarkable that the model can complete this task. Even if humans know the real-world meaning of the sequence, figuring out the next valid move requires recursively simulating the process and careful deduction. The Transformer's computational resources are fixed, it cannot explicitly complete such recursive reasoning. Therefore, even without taking this task as a starting point to understand large language models, fully understanding its principles can provide great insight into the inner mechanism of Transformers.

Model Configuration

We focus on decoder-only Transformer. The model architecture is shown below:

{ 'act_fn': 'gelu', 'd_head': 16, 'd_mlp': 512, 'd_model': 128, 'd_vocab': 61, 'd_vocab_out': 61, 'n_ctx': 60, 'n_heads': 8, 'n_layers': 6, 'n_params': 1179648, 'normalization_type': 'PreLayerNorm', 'positional_embedding_type': 'learned', 'p_dropout': 0.0 }

Although previous work has open-sourced the model architecture and parameters, this article makes some modifications for the following reasons:

Dictionary Learning Experiments

The model has a total of 12 Attn/MLP modules. We train a dictionary for the output of each module, where the input dimension of the dictionary is always d_in=128, and the hidden layer has n_components=1024. We sample 4e8 sequences for training the dictionaries, i.e. 4e8 * 60 = 2.4e10 tokens. For each token, we input it together with the context into the Transformer model, and record the output of each module. Although each token depends on the context, we treat the representations obtained for each token as completely independent during training. We shuffle and sample them to train the dictionary for reconstructing each module's output.

The figure below shows the 2-norm of the average input representation for each layer. The almost invisible error bars indicate the average reconstruction error. The dictionaries can reconstruct the output of each module with almost no loss:

After training, we have 12 dictionaries corresponding to the 0th layer Attn to the 5th layer MLP from bottom to top. We denote them as L0A-L5M for the Attn and MLP layers of layers 0-5.

For each dictionary, we compute the activation level of the features. Under the assumption that dictionary learning can extract meaningful (maybe not human-understandable) features, there is feature superposition across all layers in this model:

Since the world that needs to be modeled for this task is much simpler compared to language modeling, while its hidden dimension d_model differs from language models by only 1-2 orders of magnitude, the superposition is expected to be even more severe in real language models. In addition, techniques like resampling dead dictionary neurons can extract even more features and finer-grained interpretability .

The figure above describes the over-completeness of features inside the model. Another important property of the internal features is sparsity. The figure below shows the average number of activated features per token in each dictionary. In this model, the output of all layers can be reconstructed with fewer features than the hidden dimension:

For the same reason, features in real-world LMs should be more sparse.

Feature Interpretation

For a given input sequence, we can determine a unique board state.

[37, 29, 18, 45, 22, 26, 19, 12, 54, 53, 25, 44, 38, 21, 62, 34, 46, 63, 13, 55, 42, 30, 23, 17]

The figure below shows the board state corresponding to the above sequence:

In the original work , the authors trained a separate probe classifier for each board position and residual stream in the model to categorize the current color of the position (black, white or empty) and determine whether knowledge about the current board state is contained internally. The authors found that when the classifier is linear, the error rate is around 20%, which is relatively high. Using nonlinear classifiers can reduce the error rate to 1.7%. Thus the conclusion in the original work was that world knowledge with nonlinear representations exists internally. Subsequent work found that the conclusions in the original paper were insufficient. If the probe target is changed from (black/white/empty) to (same color as current piece / different color / empty), linear classifiers can also achieve good performance. This observation provides profound insights into the linearity assumption of features, which we elaborate on in the Dictionary Learning and Probing chapter.

Based on the "Mine vs. Theirs" perspective, we can better understand the dictionary features introduced below.

For the features extracted by dictionary learning, we only examine the "active features" that have activation frequencies above a certain threshold. We denote these active features as follows: For the Y-th feature decomposed from the Attn/MLP output of layer X, we name it LX{A/M}YFor example, the 728-th feature from the dictionary of the 3rd layer Attn is called L3A728, the 17-th feature from the 0-th layer MLP dictionary is called L0M17..

For each feature, we examine the samples that activate it the most. The figure below shows an example of the 64 inputs that activate L0A622 the most:

It is difficult to directly observe patterns from such images, and it is very prone to visual illusions. Therefore, we designed the following interface:

For a given dictionary feature, we examine the top-k inputs that activate it the most among 1.2M tokens and compute the following statistics over the k input sequences/board states:

In the above statistical plots, k is taken as 2048. Each such statistical plot reflects the behavior of one feature. The heatmap in the first row second column shows that among the 2048 inputs L0A622 is most interested in, all current moves are at position f-1. Therefore, we can interpret this as a "current move = f-1" feature. We discuss our method of interpreting dictionary features in detail in the How to Interpret Dictionary Features? section.

We mainly found the following types of features:

Compared to previous work, we have some new findings:

These two differences are not contradictory. We think the probed behaviors are a kind of compositional feature.

Overall, we find a significant portion of features can be interpreted, although some features remain opaque, usually with small activation values.

Discovering Circuits in Othello Model

In this section, we introduce circuits discovered in the Othello model.

Understanding Features and Model Output with Direct Logit Attribution

By applying Direct Logit Attribution, many Mech Interp works attribute specific logits to certain MLPs or attention heads, to help analyze their roles. Here we ask a more detailed question: If dictionary learning can decompose each module's output into a sum of interpretable features, which features contribute more to a given logit?

We randomly sample a batch of data, and randomly pick one step from it. The corresponding board state is as follows:

The model predicts the following result:

The model successfully predicts legal moves. We can analyze any logit, for example the 33rd position on the board, which has a logit of about 8.29. We compute the Direct Logit Attribution for each feature, with results shown below:

In the above figure, we have removed features with absolute contributions smaller than 0.1. We find L5M499 has a particularly prominent contribution. The behavior of this feature is:

We focus on the first row, third column statistic describing legal positions. This heatmap represents that among the 2048 inputs with the strongest activation of L5M499, a large portion describe "d-1, e-1, f-1" as legal, which coincides precisely with the aforementioned board state. Combining this with the other examples we attempted, we arrive at a conclusion: The majority of logits on the board are primarily activated by a few L5M features, exhibiting direct causality, and these features tend to be relatively specialized in their responses.

Understanding MLP Feature Activations Through Approximate Direct Contributions

In the previous section, we established a connection between the model output and the features of the highest MLP layer. A natural follow-up question is: how are these MLP features computed?

We randomly select a board state again, as shown below:

In the board state shown above, we choose the feature L2M845 with the highest activation in L2M, which has an activation value of 1.4135. The behavior of this feature is relatively easy to understand; it indicates that the model plays a move on b-4 or b-5 and flips the piece at c-4:

We list the features with absolute approximate direct contribution values less than 0.05:

We find that there are four important contributors: L0A837, L1M280, embedding, and L1M49. A brief description of the corresponding features is as follows:

The common aspect of these four features is that they all describe the flipping situation in column 4. From a human-understandable perspective, these features are sufficient conditions for {L2M845: c-4 is flipped}. We have some confidence that this reveals a pattern in which the model derives higher-level features from lower-level features.

Understanding Information Transfer in the OV Circuit

We again randomly sample a board state:

We find that L2A474 primarily describes a specific board state centered around c-2, which corresponds well to the current board state, with an activation value of 0.70 in the current state:

Using the analysis theory of the OV circuit, we list the contributions of all features below L2A to this feature:

The image omits features with absolute contributions less than 0.03, where PX represents information from the residual stream of the Xth token (X ranges from 0-7 in this example). We find that the three features with the largest contributions are all related to this board state:

Additionally, we find that P6L0A629 has a strong side effect on the activation of P7L2A474, which also has a strong interpretable meaning: L0A629 mainly describes that c-3 is the player's own piece, but since P6L0A629 is located in the preceding token residual stream of P7L2A474, this contradicts P7L2A474's description that c-3 is the opponent's piece, because the perception of one's own pieces and the opponent's pieces is flipped in residual streams separated by an odd number of steps. This contradiction mainly arises because P7's move happens to flip the piece at c-3, while the previous residual stream does not contain future information. We conjecture that there is a very subtle balance in the model, where the negative impact brought by past tokens due to piece flipping is eliminated or even overridden by the positive impact of features describing the piece flipping, thereby always maintaining the most accurate board state information as the game progresses.

For readers, understanding this part should be quite difficult, as the Othello model's unique way of understanding "self vs. opponent" and the complicated board notation pose great challenges for both expression and comprehension. In short, we find that the information transferred through the OV circuit has strong interpretable meanings. The Attn in one residual stream can largely transform the interpretable features brought from other residual streams into interpretable features in the current residual stream. This repeatedly leaves us in awe of the miraculous information flow mechanism inside the Transformer. At the same time, we become more convinced that understanding the model's behavior is not an extremely complex problem; with careful observation, we have a good chance of comprehending these complex information flows.

Understanding the Formation of Attention Strengths

Attention strength is one of the most accessible entry points for interpretability research. Through attention distribution heatmaps, we can easily recognize how much attention each token pays to other tokens. In this section, we analyze one of the most prevalent attention patterns in this model, which takes the following interleaved attention structure:

We conjecture that the formation of this attention pattern is to transfer information from the residual streams corresponding to the player's own moves and the opponent's moves, respectively. If a token is an even number of steps away from the current token, the features describing "belonging to oneself" in the past residual streams should enhance the corresponding features of the current step. We find that this mechanism is often implemented through positional encoding.

Here, we provide an example. In the attention pattern shown in the image above, the last token assigns strong attention to the tokens corresponding to the opponent's moves.

Applying our circuit analysis theory, we investigate which features contribute to token 14's attention score of 5.80 towards token 13 before the softmax operation. The linear decomposition of the contributions is shown in the following image:

We previously discovered that the positional encoding learned by the model contains significant information about the current move color, and the lower layers of the model have several features representing similar concepts. We find that many of the top contributing feature pairs in the image include positional encoding, and we also find that several other important features (e.g., P13L0M1015 and P14L0A195) are strongly correlated with the current move color. This indicates that our circuit discovery theory can identify features related to positional encoding and find that these features directly contribute to the aforementioned attention patterns.

Summary

This part is the core of this paper. Through experiments, we are more confident that our circuit discovery theory can discover the circuits of the model at an unprecedented granularity (i.e., dictionary-learned features). We can understand how a considerable portion of the MLP features and Attn features are computed, and the experimental results are consistent with human understanding.

Under our theoretical framework, along with careful experimentation and summarization, we can gain a relatively in-depth understanding of the general workflow of the Othello model:

The findings in this chapter alone are sufficient for us to gain a considerable understanding of the internal structure of the Transformer. For example, in the OV circuit analysis, we discovered that the interpretable relationships and contribution/suppression relationships between features have a strong positive correlation. Additionally, the formation of high-level abstract features in the MLP can often be decomposed into lower-level basic features in a human-understandable manner through approximate direct contributions.

These results give us confidence in understanding language models, and we also gain a lot of knowledge about Transformer models from them. In this chapter, we only show some representative results; this theory can discover far more phenomena than what is presented here. Interested readers can download and run the interactive open-source implementation from the GitHub repo to reproduce or explore the circuits inside this model.

Overall Discussion

Othello and Language Models

Our ultimate goal is to reverse engineer Transformer language models, with the Othello model serving only as a proof of concept. The Othello task is a decent simplified proxy for a language model because its complexity cannot be accomplished solely through memorization; the internal world of this "Othello language" is relatively rich and contains several abstract concepts. This level of complexity makes it valuable as an initial attempt, as tasks that are too different from language models, such as arithmetic tasks, may not yield transferable conclusions. At the same time, it is not overly complex to increase the difficulty of training the dictionary or interpretation, with only 60 tokens in the vocabulary size and the property of no repeated inputs, which to some extent makes our interpretation more clear by excluding many confusing concepts.

We summarize two key differences between Othello and language models that we believe are worth noting within the framework of dictionary learning:

Both the superposition hypothesis and our dictionary learning process assume that the representation space can be decomposed into a sum of interpretable features. These features appear in the form of unit vectors in space, and their activation strengths represent the strengths of these features.

The activation strengths of these features are non-negative values. In our implementation, we use the ReLU function of the dictionary-learned hidden layer to remap all negative feature activations to zero for this reason.

Under this assumption, features in opposite directions generally have a strong negative correlation, rather than representing the positive and negative activations of a single feature. These two cases are not entirely mutually exclusive. For example, in the feature superposition structure shown in the following figure, the green and blue features represent two unrelated concepts, which are represented in opposite directions due to sparsity. Meanwhile, the model places two mutually exclusive features along the lower-left to upper-right direction, which we can equivalently understand as the positive and negative activations of a single feature. Regardless of how we interpret it, this space must have experienced feature superposition.

Overall, we believe that the results from this task have some reference value, and the differences from real language models will not affect our basic conclusions but remain important details.

How to Interpret Dictionary-Learned Features?

Suppose we extract the dictionary-learned features and can obtain several inputs with relatively high activations for each feature. How should we interpret these inputs? In the long run, we will inevitably need some form of automated method, such as using large language models to interpret language features. Even so, human interpretation is still needed to verify and supplement these results, as the ultimate goal of interpreting features is for human understanding. At least one stage of the entire interpretation process requires human intervention. We believe that how to interpret these features at scale is an important question. We conjecture that we could use powerful multimodal large models to interpret the board states that a particular feature is most interested in, and then employ human efforts to adjust the interpretation results. However, since this is not our research focus, we did not use these methods, but rather relied on human intuitive interpretation.

To evaluate the interpretation results, we conducted verification on a small number of features, following a process similar to existing automated interpretation approaches. We illustrate with an example:

Visualizing the feature activations can provide an intuitive understanding of activation specificity and sensitivity, but quantitative evaluation metrics are still necessary, which we believe is an important direction for future research.

Dictionary Learning and Probing

Both dictionary learning and probing can discover interpretable directions in a model's hidden layers. An important difference between them is that dictionary learning additionally introduces a reconstruction error, clarifying the distance from fully understanding the hidden features. Ideally, if dictionary learning can achieve zero reconstruction error, and each feature is highly monosemantic and interpretable, we would be able to completely understand the entire set of features contained in this hidden layer.

Another important difference is that, compared to probing methods that require first proposing definitions or classification labels for features, dictionary learning can extract features in an unsupervised manner. However, supervisory signals may still be necessary because interpreting these features is likely to require a significant amount of prior knowledge. The task demonstrated in this paper is an excellent example. In the original work, the authors' linear probing for black and white pieces did not perform well, but subsequent work significantly improved the probing performance from the perspective of "self" and "opponent," indicating that our prior understanding of the internal feature families of dictionary learning may help us interpret a large batch of originally unclear features.

We believe that the most important issue is the fundamentality of dictionary-learned features, which we discuss separately in the next section.

Basic Features and Compositional Features

In prior research on OthelloGPT, researchers used probing methods to discover Flipped features in the model, which indicate whether a certain board position was flipped in the current move. However, we did not directly find such features in our dictionary learning results. But similarly, we identified a series of features that indicate the current move is played at a certain location and flips pieces in a fixed direction. For example, in L0M, there is a pair of features, both corresponding to the current move being played at c-2, but L0M195 is only activated when the move flips the pieces to the right, while L0M205 is activated when the move flips the pieces to the upper right. These two features have a cosine similarity of -0.13 in the representation space, indicating a considerable degree of independence between them.

Therefore, we conjecture that the set of features obtained through dictionary learning may be more "basic" features. In the above example, the Flipped features obtained through probing could be a linear combination of the corresponding flipping features from the eight surrounding directions of that board position, making it a compositional feature formed by basic features. This is a rather vague topic. We believe that "basicness" is a relative concept. Anthropic's dictionary learning research has shown that as the dictionary hidden layer keeps expanding, the granularity of the descriptions of the obtained features becomes finer, and subsets of the features in a large dictionary can form a "feature cluster" that manifests as a single or fewer subsets in a smaller dictionary. Based on this observation, we believe that the larger the dictionary, the more basic the features obtained through decomposition, due to the motivation in the dictionary training process to form sparse decompositions with a limited number of neurons. In contrast, the feature directions obtained through probing do not have any prior notion of basicness, which is the basis for our conclusion above.

Similarly, we conjecture that features in real language models indicating sentiment or factuality may also be linear combinations of a set of basic features. We believe this issue could be highly important in future dictionary learning research plans.

Circuits and Randomness

In the three circuits of QK, OV, and MLP, we can characterize a certain type of contribution. In the QK circuit, this contribution is the bilinear product of arbitrary feature pairs from two residual streams; in the OV circuit, it is the result of features from other token residual streams passing through the OV circuit; and in the MLP, it is the approximate direct contribution.

However, due to feature superposition, independent feature pairs cannot be represented in orthogonal directions, so their contributions should follow some random distribution. We need to clarify whether the contribution between any two features arises from randomness or is a result of the circuits learned by the Transformer.

For example, in the approximate direct contribution figure mentioned earlier, we cannot determine where to draw the line to clearly distinguish whether the strong contributing features before that point are intentionally implemented by the model, while those after are just minor noise caused by feature superposition:

The reality is more likely that such a line does not exist. An important motivation for feature superposition is to reduce training loss, so establishing a weak positive superposition between two weakly positively correlated features aims to minimize information interference in expectation. Therefore, the contributions we observe to some extent reflect the strength of these correlation relationships.

This blurs our understanding of circuit definitions. Combined with existing circuit research, we are more convinced that most behaviors inside the model are a combination of multiple positive and negative circuits. Just as the model's predicted logits are a mixed strategy, the internal flow of information is also a mixture.

Under this conjecture, we may only be able to understand the most significant part and be prepared to encounter potential explainability illusions at any time. Nevertheless, we believe that this alone is sufficient for us to gain a very deep understanding of how Transformers work internally.

Furthermore, since features in real language models are sparser, the hidden layers are wider, and the semantic set (i.e., the world) spanned by all features is more extensive, we conjecture that the information flow from the OV circuit will be more explicit in language models.

Scalability of This Paper's Theoretical Framework

Ultimately, we hope to apply the methods discussed in this paper, with many engineering improvements, to larger language models using similar analyses. Although the model used in this paper is relatively small and the task is not language modeling, we believe that the existing results are sufficient to support direct application to language models since both the dictionary learning and circuit analysis parts are independent of model size or task.

An important outstanding issue is dictionary training for language models. On the most basic sparse autoencoder structure, many works have proposed algorithmic improvements and practical experiences; better algorithm design can help us find the Pareto optimal boundary among computational resources, reconstruction error, and feature interpretability. This problem is crucial for all dictionary learning works. For instance, the ultimate goal of sparse constraint optimization should be the L0-norm of the dictionary hidden layer. The widely adopted L1-norm has nice properties such as convexity, but whether better sparse constraint losses exist is a potentially important question. In dictionary optimization, techniques like warming up Adam momentum and resampling dormant features also need to be verified and supplemented.

The interpretation of dictionary features has been discussed previously. In language models, using powerful language models to automatically interpret features can at least provide a reasonably credible initial value for each feature. Additionally, building an interactive interpretability interface will likely be the core interface for human fine-tuning or analysis. We believe that these engineering problems can potentially be integrated and developed into a more mature interpretability paradigm through continuous optimization.

Our circuit analysis theory itself is not affected by scaling issues, but certain adjustments need to be made to accommodate developments in the Transformer architecture. The analysis of the QK circuit is fully compatible with popular positional encoding methods. The GLU module adopted in the popular Llama architecture uses a portion of the parameters to perform data-dependent gating, which would cause approximate direct contributions to lose more information. We need to better generalize approximate direct contributions to better address the interpretation of the MLP circuit.

In the practice of circuit analysis, an interactive interface would greatly facilitate circuit discovery. These engineering problems could potentially become byproducts of the process of scaling up interpretability.

Summary

We propose a highly general interpretability theory. Under the assumption that dictionary learning can extract as many interpretable features as possible from each MLP and attention module, we further propose a circuit discovery theoretical framework that can connect all features in the computation graph, forming an extremely dense connection diagram.

Although this ideal diagram involves a vast number of connections, we believe that feature sparsity and a considerable degree of independence between features are helpful for understanding these connections, allowing us to comprehend the relationship between each feature and all its bottom features within a complexity acceptable to humans. Sparsity ensures that each input only activates a few features, and we conjecture that the activation of each feature originates from only a few of its bottom features rather than the joint action of all features.

Due to the generality of our theory, our scope covers many previous mechanistic interpretation research approaches. We conjecture that features discoverable through training linear probes can also be found through dictionary learning, and methods such as Activation Patching can also be naturally applied within this framework. If this method can be applied to at least GPT2-small level language models, we believe that we can "rediscover" many existing conclusions within this framework, including the local effects of attention heads (groups), knowledge in the MLP, and the discovery of various global circuits, and help us uncover new phenomena.

We can summarize the significance of this theory with two "generalities". First is feature generality: we do not need any prior knowledge about the internal information of the model; we only need to train a dictionary in an unsupervised manner to decompose many interpretable features. Second is circuit generality: our circuit analysis weakens the prior understanding of circuit structures; we only need to locate the relevant features and start from them (or more simply, from the output), decomposing feature activations into the contributions of their (or other tokens') bottom features. This process can easily discover local circuits, and recursively applying this process may help us uncover many composite circuits.

However, precisely because of its generality, it is unrealistic to discuss all the details in a single essay like this. Even the core dictionary learning component requires substantial coverage to elucidate many important details. At the same time, due to this generality, the conclusions we can present here are only a tiny fraction of the model's many behaviors. If we view the portions we have presented as representatives of "feature clusters" and "circuit clusters", we can be somewhat confident that we may understand part of the model's internal workings, but we have not yet explained all phenomena of Othello-GPT, such as endgame circuits.

Although we believe that this analysis has decomposed many mysteries about Transformers, how to systematize understanding for humans remains an issue. Understanding the source of activation for a specific logit or feature has a relatively fine granularity, and therefore, such understanding would need to be repeated an astronomical number of times to fully comprehend every feature and behavior of the model.

Furthermore, we believe that we are at an early stage in both dictionary learning and circuit discovery theory. We cannot claim that dictionary learning has completely extracted all features, or that every feature is monosemantic or interpretable, or that every discovered feature activation can be sufficiently or accurately decomposed into its lower-level sources. Any optimization of dictionary learning and circuit discovery theory may open up new possibilities for interpretability.

But regardless of which details we are concerned about, we believe that this theory lays a good foundation for subsequent interpretability research, especially dictionary learning-based research, giving us more hope for the ultimate goal of fully understanding the internal workings of Transformers.

Another important significance of this work is that it is the first post from the Open-MOSS Interpretability group, clarifying some of our thoughts on interpretability research. We hope to focus on the systematic development of Mechanistic Interpretability, find a connection between human understanding and high-dimensional representation spaces, and peel away billions of model parameters to understand their intelligence. Dictionary learning is currently a foundation we place high hopes in, but we are always ready to embrace new possibilities.

Contributions

Zhengfu He proposed the theoretical framework of this paper, completed the experiments, and wrote the initial draft. Tianxiang Sun provided extensive feedback on the paper and raised the viewpoints covered in the "Interpreting Dictionary Features" section of the general discussion. Qiong Tang was responsible for the visualizations and figures in the paper, and provided constructive suggestions for the experimental design in the fourth part. Xipeng Qiu is the team's supervisor, offering important advice on the "Othello and Language Models" and "Scalability" sections of the general discussion.

Acknowledgments

The open-source research on OthelloGPT by Kenneth Li and Neel Nanda had a significant impact on this work. We made modifications based on the existing substantial work on the Othello game and board state visualization, which greatly reduced the interpretation cost of dictionary features.

Neel Nanda's contributions to the Mech Interp community and insightful idea sharing had an important influence on the conception of this paper. His open-source Transformer_lens library solved a considerable portion of the engineering challenges in the Mech Interp field and provided an essential framework for implementing the work presented in this paper.

The computational resources used in this paper were supported by the Intelligent Computing Platform (CFFF) of Fudan University. The enthusiasm and professionalism of the CFFF staff provided substantial assurance for the smooth progress of this work.

The completion of this paper would have been extremely difficult without any of the aforementioned contributions, and we are deeply grateful for the support from all parties involved.