With dictionary learning, we dig in a small transformer trained on a synthetic task and find a number of human-understandable fine-grained information flow inside of it.
An informal post to share our recent progress on circuit discovery with dictionary learning. We recommend readers to see our first arxiv version for cleaner demonstration (More experiments coming up). Open source code is to be published at our GitHub repository.
In recent years, advances in transformer-based language models have sparked interest in better understanding the
internal computational workings of these systems.
Researchers have made some progress in identifying interpretable circuits and algorithms within GPT-2, but much of
the model's broad language generation capabilities remain opaque.
The emerging field of Mechanistic Interpretability
In the mechanistic view of transformers, understanding model activations is a central task. Activations answer an important question in mechanistic interpretability: what high-level features does the model compute? Recent advances in sparse dictionary learning have opened up new possibilities for extracting more interpretable, monosemantic features out of superposition. Hypothesis of linear representation grants researchers the convenience of feature superposition and attacking the dimension curse. By learning sparse dictionaries that decompose activations into semantically meaningful directions in the representation space, we are able to gain more microscopic insight of model representations.
This work proposes a circuit discovery framework utilizing sparse dictionaries to decompose activation spaces into interpretable information flows that can be traced through a subset of layers or end-to-end in the model. Our framework aims to answer three questions:
These questions as a whole explains almost every property of a model. They, however, are somehow orthogonal in our research agenda. Developing dictionary training techniques and better interpreting methods of dictionary features is crucial in the first question. Prior work has made some advances on this topic. The second question is in comparison less discussed in the literature. The last can be solved with existing techniques in Mech Interp.
We apply our theoretical framework to analyze a decoder-only transformer trained on a synthetic task called Othello. Experiments provide concrete evidence that dictionary learning can extract interpretable features and improve end-to-end circuit discovery. Moreover, we are able to determine how a given feature is activated by its lower-level computations, which has been challenging for existing mechanistic interpretability methods like probing and patching.
Sparse dictionary has shown its great potential on extracting monosemantic features in transformers in an unsupervised manner. It works to an unprecedented extent on multiple positions of the residual stream and on multiple sizes of models. We claim that dictionary learning also grants new possibilities for circuit discovery. We can start from any feature (or model output), recursively tracing down to the input embedding to find one (or a group) of local (or end-to-end) circuits.
We follow A Mathematical Framework of
Transformer Circuits
If the internal structure of Transformers were more interpretation-friendly, with linear interpretable features
neatly corresponding to neurons and their activation strengths, it would be straightforward to analyze each
neuron's purpose based on when it activates, and understand the model's reasoning by looking at each neuron's
activation value for a given output.
This assumption has major limitations
Furthermore, prior work on Privileged Bases
Due to the aforementioned issues with understanding neural network model internal from the neuron perspective,
there is a need for a more general approach to find an "interpretable basis" constituting these explainable
directions, which are quite likely to be overcomplete.
Moreover, a major drive behind superposition is the sparsity of the features, which is also an important property
often present in real-world tasks.
Therefore, utilizing sparse dictionary learning to extract features aligns well with these two important
properties of overcompleteness and sparsity.
Overall, the goal of sparse dictionary learning is to find, through an autoencoder, a set of overcomplete bases
For a model activation captured at a given position in the transformer, we can decompose it into a weighted sum of a group of (more) interpretable dictionary features:
Viewing the residual stream as the memory management center of transformer, and each Attention and MLP module read
from and write to it, is an important concept for understanding how information flows in transformers.
Previous work based on dictionary learning has typically studied: word embedding
Based on the above analysis, we believe it could be beneficial to use dictionary learning to decompose the following three parts: word representations, the output of each Attention layer, and the output of each MLP layer. Although there has already been considerable Mechanistic Interpretability work analyzing Attention heads compared to MLPs, we think incorporating them into a unified dictionary learning framework is necessary. This setting would be helpful for understanding Transformers in a systematic and scalable way.
As shown above, the input of any Attention or MLP block
In prevalent transformer architectures, LayerNorm is usually applied after each module makes a copy of the residual flow, i.e. pre-norm. Although the input to each module can be linearly decomposed into the sum of outputs from all bottom modules, LayerNorm itself is not a linear operation. This prevents us from attributing a certain consequence to each linear component, which is an important issue to resolve for facilitating subsequent analyses.
The above pseudocode describes the computation process of LayerNorm, where the step of calculating the standard deviation is non-linear to the input x. To address this, we treat the standard deviation of x as a constant rather than a function of x. This allows us to transform LayerNorm into a linear function of x without changing the computational result. With this transformation, we can now apply the modified LayerNorm separately to any linear decomposition of x to estimate the impact of each component on the result. This resolves the issue of analyzing LayerNorm for facilitating subsequent analyses.
Specifically, when we are interested in the final LayerNorm before the unembedding, this technique can be used to analyze the impact of each module on the output logits. This technique is called Direct Logit Attribution and has been applied in quite a few Mech Interp works.
Each attention head needs to transfer the input of token
The superscript
Due to the independent additivity of multi-head attention, the output of the attention module at token
Therefore, the output of an attention module at token
Moreover, we decompose the output of each module into weighted sum of dictionary features:
Each dictionary decomposition is composed of a group of activation magnitude
Dictionary encoder of LXA takes in
By utilizing the linearized LayerNorm introduced in the last section, we manage to attribute the activation magnitude of Y-th dictionary feature in the i-th token to all dictionary features in the bottom of LXA of all tokens.
In LXA, each head determines how much proportion of its attention from token
We denote the item before Softmax
In particular, since input of each token to LXA can be decomposed to dictionary features of bottom modules in its own residual stream, i.e.
By further denoting the sum of all dictionary features of all bottom modules as a whole, we get a clear bilinear
form to dissect
Thus for any given
The MLP module accounts for over half of the parameters in Transformer models, yet research on interpreting MLPs is significantly less than for attention modules in the field. This may be because the MLP itself has a simpler form, while neuron-based research has certain limitations. Under the framework of dictionary learning, we appear able to understand more about the internal features of the MLP.
The output of LXM can be written as follows:
Again, by viewing all dictionary features in one residual stream as a whole, we thus get the activation of the Y-th dictionary feature of the X-th layer:
We conjecture that a given MLP feature is activated by a small subset of lower features. To verify this, we need to measure the contribution of each lower feature to the MLP feature. In addition, if this is true, it would be inspiring to identify these core contributors.
We define the approximate direct contribution
We omitted LayerNorm in the definition above for simplicity since it can be linearized. Then the dictionary feature of an MLP output can be written as:
The intuition behind
The non-linearity introduced by the activation function Act_fn in MLP hidden layers is key to interpretability.
We find that prevalent activation functions are always in some form of self-gating
We consider the effect of input features on each MLP neuron since the activation of any dictionary feature is a linear function of MLP neurons.
If we consider the contribution of a single input feature
Approximate direct contribution can capture the former type of contribution since for any monotonically
non-decreasing non-negative self-gating function
However, the latter type of contribution, suppressing neurons with negative effects, is not captured well by
We used a 1.2M parameter decoder-only Transformer to learn a programmable game prediction task named Othello. The model only learns to play legal moves, not tactics.
The rule of Othello is as follows: Two players compete, using 64 identical game pieces ("disks") that are light on one side and dark on the other. Each player chooses one color to use throughout the game. Players take turns placing one disk on an empty tile, with their assigned color facing up. After a play is made, any disks of the opponent's color that lie in a straight line bounded by the one just played and another one in the current player's color are turned over.
The figure above shows the progress of a game
As shown in the figure, there are 60 empty slots on the chessboard, so the game lasts 60 moves in total. By recording the position of each move, we can represent a game using a sequence of length 60:
By sampling a position from the set of valid moves at each step, we can generate millions of such game records. Our setup is to model these sequences in an auto-regressive manner. Just like a language model predicts the probability of the next word, this task models the probability of the next legal move.
This task was originally proposed in this ICLR 2023 spotlight paper
One ingenious aspect of this task is that the input sequence itself provides very little information: only the order of moves, without the current state of the board. In addition, the model has no prior knowledge about the board or rules - it does not know the input sequence unfolds according to alternating players, nor the mapping between the input sequence and board positions. Given such difficult conditions, it is remarkable that the model can complete this task. Even if humans know the real-world meaning of the sequence, figuring out the next valid move requires recursively simulating the process and careful deduction. The Transformer's computational resources are fixed, it cannot explicitly complete such recursive reasoning. Therefore, even without taking this task as a starting point to understand large language models, fully understanding its principles can provide great insight into the inner mechanism of Transformers.
We focus on decoder-only Transformer. The model architecture is shown below:
Although previous work
The model has a total of 12 Attn/MLP modules. We train a dictionary for the output of each module, where the input dimension of the dictionary is always d_in=128, and the hidden layer has n_components=1024. We sample 4e8 sequences for training the dictionaries, i.e. 4e8 * 60 = 2.4e10 tokens. For each token, we input it together with the context into the Transformer model, and record the output of each module. Although each token depends on the context, we treat the representations obtained for each token as completely independent during training. We shuffle and sample them to train the dictionary for reconstructing each module's output.
The figure below shows the 2-norm of the average input representation for each layer. The almost invisible error bars indicate the average reconstruction error. The dictionaries can reconstruct the output of each module with almost no loss:
After training, we have 12 dictionaries corresponding to the 0th layer Attn to the 5th layer MLP from bottom to top. We denote them as L0A-L5M for the Attn and MLP layers of layers 0-5.
For each dictionary, we compute the activation level of the features. Under the assumption that dictionary learning can extract meaningful (maybe not human-understandable) features, there is feature superposition across all layers in this model:
Since the world that needs to be modeled for this task is much simpler compared to language modeling, while its
hidden dimension d_model differs from language models by only 1-2 orders of magnitude, the superposition is
expected to be even more severe in real language models.
In addition, techniques like resampling dead dictionary neurons can extract even more features and finer-grained
interpretability
The figure above describes the over-completeness of features inside the model. Another important property of the internal features is sparsity. The figure below shows the average number of activated features per token in each dictionary. In this model, the output of all layers can be reconstructed with fewer features than the hidden dimension:
For the same reason, features in real-world LMs should be more sparse
For a given input sequence, we can determine a unique board state.
The figure below shows the board state corresponding to the above sequence:
In the original work
Based on the "Mine vs. Theirs" perspective, we can better understand the dictionary features introduced below.
For the features extracted by dictionary learning, we only examine the "active features" that have activation
frequencies above a certain threshold.
We denote these active features as follows: For the Y-th feature decomposed from the Attn/MLP output of layer X,
we name it LX{A/M}Y
For each feature, we examine the samples that activate it the most. The figure below shows an example of the 64 inputs that activate L0A622 the most:
It is difficult to directly observe patterns from such images, and it is very prone to visual illusions. Therefore, we designed the following interface:
For a given dictionary feature, we examine the top-k inputs that activate it the most among 1.2M tokens and compute the following statistics over the k input sequences/board states:
In the above statistical plots, k is taken as 2048. Each such statistical plot reflects the behavior of one feature. The heatmap in the first row second column shows that among the 2048 inputs L0A622 is most interested in, all current moves are at position f-1. Therefore, we can interpret this as a "current move = f-1" feature. We discuss our method of interpreting dictionary features in detail in the How to Interpret Dictionary Features? section.
We mainly found the following types of features:
Compared to previous work, we have some new findings:
These two differences are not contradictory. We think the probed behaviors are a kind of compositional feature.
Overall, we find a significant portion of features can be interpreted, although some features remain opaque, usually with small activation values.
In this section, we introduce circuits discovered in the Othello model.
By applying Direct Logit Attribution, many Mech Interp works attribute specific logits to certain MLPs or attention heads, to help analyze their roles. Here we ask a more detailed question: If dictionary learning can decompose each module's output into a sum of interpretable features, which features contribute more to a given logit?
We randomly sample a batch of data, and randomly pick one step from it. The corresponding board state is as follows:
The model predicts the following result:
The model successfully predicts legal moves. We can analyze any logit, for example the 33rd position on the board, which has a logit of about 8.29. We compute the Direct Logit Attribution for each feature, with results shown below:
In the above figure, we have removed features with absolute contributions smaller than 0.1. We find L5M499 has a particularly prominent contribution. The behavior of this feature is:
We focus on the first row, third column statistic describing legal positions. This heatmap represents that among the 2048 inputs with the strongest activation of L5M499, a large portion describe "d-1, e-1, f-1" as legal, which coincides precisely with the aforementioned board state. Combining this with the other examples we attempted, we arrive at a conclusion: The majority of logits on the board are primarily activated by a few L5M features, exhibiting direct causality, and these features tend to be relatively specialized in their responses.
In the previous section, we established a connection between the model output and the features of the highest MLP layer. A natural follow-up question is: how are these MLP features computed?
We randomly select a board state again, as shown below:
In the board state shown above, we choose the feature L2M845 with the highest activation in L2M, which has an activation value of 1.4135. The behavior of this feature is relatively easy to understand; it indicates that the model plays a move on b-4 or b-5 and flips the piece at c-4:
We list the features with absolute approximate direct contribution values less than 0.05:
We find that there are four important contributors: L0A837, L1M280, embedding, and L1M49. A brief description of the corresponding features is as follows:
The common aspect of these four features is that they all describe the flipping situation in column 4. From a human-understandable perspective, these features are sufficient conditions for {L2M845: c-4 is flipped}. We have some confidence that this reveals a pattern in which the model derives higher-level features from lower-level features.
We again randomly sample a board state:
We find that L2A474 primarily describes a specific board state centered around c-2, which corresponds well to the current board state, with an activation value of 0.70 in the current state:
Using the analysis theory of the OV circuit, we list the contributions of all features below L2A to this feature:
The image omits features with absolute contributions less than 0.03, where PX represents information from the residual stream of the Xth token (X ranges from 0-7 in this example). We find that the three features with the largest contributions are all related to this board state:
Additionally, we find that P6L0A629 has a strong side effect on the activation of P7L2A474, which also has a strong interpretable meaning: L0A629 mainly describes that c-3 is the player's own piece, but since P6L0A629 is located in the preceding token residual stream of P7L2A474, this contradicts P7L2A474's description that c-3 is the opponent's piece, because the perception of one's own pieces and the opponent's pieces is flipped in residual streams separated by an odd number of steps. This contradiction mainly arises because P7's move happens to flip the piece at c-3, while the previous residual stream does not contain future information. We conjecture that there is a very subtle balance in the model, where the negative impact brought by past tokens due to piece flipping is eliminated or even overridden by the positive impact of features describing the piece flipping, thereby always maintaining the most accurate board state information as the game progresses.
For readers, understanding this part should be quite difficult, as the Othello model's unique way of understanding "self vs. opponent" and the complicated board notation pose great challenges for both expression and comprehension. In short, we find that the information transferred through the OV circuit has strong interpretable meanings. The Attn in one residual stream can largely transform the interpretable features brought from other residual streams into interpretable features in the current residual stream. This repeatedly leaves us in awe of the miraculous information flow mechanism inside the Transformer. At the same time, we become more convinced that understanding the model's behavior is not an extremely complex problem; with careful observation, we have a good chance of comprehending these complex information flows.
Attention strength is one of the most accessible entry points for interpretability research. Through attention distribution heatmaps, we can easily recognize how much attention each token pays to other tokens. In this section, we analyze one of the most prevalent attention patterns in this model, which takes the following interleaved attention structure:
We conjecture that the formation of this attention pattern is to transfer information from the residual streams corresponding to the player's own moves and the opponent's moves, respectively. If a token is an even number of steps away from the current token, the features describing "belonging to oneself" in the past residual streams should enhance the corresponding features of the current step. We find that this mechanism is often implemented through positional encoding.
Here, we provide an example. In the attention pattern shown in the image above, the last token assigns strong attention to the tokens corresponding to the opponent's moves.
Applying our circuit analysis theory, we investigate which features contribute to token 14's attention score of 5.80 towards token 13 before the softmax operation. The linear decomposition of the contributions is shown in the following image:
We previously discovered that the positional encoding learned by the model contains significant information about the current move color, and the lower layers of the model have several features representing similar concepts. We find that many of the top contributing feature pairs in the image include positional encoding, and we also find that several other important features (e.g., P13L0M1015 and P14L0A195) are strongly correlated with the current move color. This indicates that our circuit discovery theory can identify features related to positional encoding and find that these features directly contribute to the aforementioned attention patterns.
This part is the core of this paper. Through experiments, we are more confident that our circuit discovery theory can discover the circuits of the model at an unprecedented granularity (i.e., dictionary-learned features). We can understand how a considerable portion of the MLP features and Attn features are computed, and the experimental results are consistent with human understanding.
Under our theoretical framework, along with careful experimentation and summarization, we can gain a relatively in-depth understanding of the general workflow of the Othello model:
The findings in this chapter alone are sufficient for us to gain a considerable understanding of the internal structure of the Transformer. For example, in the OV circuit analysis, we discovered that the interpretable relationships and contribution/suppression relationships between features have a strong positive correlation. Additionally, the formation of high-level abstract features in the MLP can often be decomposed into lower-level basic features in a human-understandable manner through approximate direct contributions.
These results give us confidence in understanding language models, and we also gain a lot of knowledge about Transformer models from them. In this chapter, we only show some representative results; this theory can discover far more phenomena than what is presented here. Interested readers can download and run the interactive open-source implementation from the GitHub repo to reproduce or explore the circuits inside this model.
Our ultimate goal is to reverse engineer Transformer language models, with the Othello model serving only as a proof of concept. The Othello task is a decent simplified proxy for a language model because its complexity cannot be accomplished solely through memorization; the internal world of this "Othello language" is relatively rich and contains several abstract concepts. This level of complexity makes it valuable as an initial attempt, as tasks that are too different from language models, such as arithmetic tasks, may not yield transferable conclusions. At the same time, it is not overly complex to increase the difficulty of training the dictionary or interpretation, with only 60 tokens in the vocabulary size and the property of no repeated inputs, which to some extent makes our interpretation more clear by excluding many confusing concepts.
We summarize two key differences between Othello and language models that we believe are worth noting within the framework of dictionary learning:
Both the superposition hypothesis and our dictionary learning process assume that the representation space can be decomposed into a sum of interpretable features. These features appear in the form of unit vectors in space, and their activation strengths represent the strengths of these features.
The activation strengths of these features are non-negative values. In our implementation, we use the ReLU function of the dictionary-learned hidden layer to remap all negative feature activations to zero for this reason.
Under this assumption, features in opposite directions generally have a strong negative correlation, rather than representing the positive and negative activations of a single feature. These two cases are not entirely mutually exclusive. For example, in the feature superposition structure shown in the following figure, the green and blue features represent two unrelated concepts, which are represented in opposite directions due to sparsity. Meanwhile, the model places two mutually exclusive features along the lower-left to upper-right direction, which we can equivalently understand as the positive and negative activations of a single feature. Regardless of how we interpret it, this space must have experienced feature superposition.
Overall, we believe that the results from this task have some reference value, and the differences from real language models will not affect our basic conclusions but remain important details.
Suppose we extract the dictionary-learned features and can obtain several inputs with relatively high activations
for each feature. How should we interpret these inputs? In the long run, we will inevitably need some form of
automated method, such as using large language models to interpret language features
To evaluate the interpretation results, we conducted verification on a small number of features, following a process similar to existing automated interpretation approaches. We illustrate with an example:
Visualizing the feature activations can provide an intuitive understanding of activation specificity and sensitivity, but quantitative evaluation metrics are still necessary, which we believe is an important direction for future research.
Both dictionary learning and probing
Another important difference is that, compared to probing methods that require first proposing definitions or classification labels for features, dictionary learning can extract features in an unsupervised manner. However, supervisory signals may still be necessary because interpreting these features is likely to require a significant amount of prior knowledge. The task demonstrated in this paper is an excellent example. In the original work, the authors' linear probing for black and white pieces did not perform well, but subsequent work significantly improved the probing performance from the perspective of "self" and "opponent," indicating that our prior understanding of the internal feature families of dictionary learning may help us interpret a large batch of originally unclear features.
We believe that the most important issue is the fundamentality of dictionary-learned features, which we discuss separately in the next section.
In prior research on OthelloGPT, researchers used probing methods to discover Flipped features in the model, which indicate whether a certain board position was flipped in the current move. However, we did not directly find such features in our dictionary learning results. But similarly, we identified a series of features that indicate the current move is played at a certain location and flips pieces in a fixed direction. For example, in L0M, there is a pair of features, both corresponding to the current move being played at c-2, but L0M195 is only activated when the move flips the pieces to the right, while L0M205 is activated when the move flips the pieces to the upper right. These two features have a cosine similarity of -0.13 in the representation space, indicating a considerable degree of independence between them.
Therefore, we conjecture that the set of features obtained through dictionary learning may be more "basic" features. In the above example, the Flipped features obtained through probing could be a linear combination of the corresponding flipping features from the eight surrounding directions of that board position, making it a compositional feature formed by basic features. This is a rather vague topic. We believe that "basicness" is a relative concept. Anthropic's dictionary learning research has shown that as the dictionary hidden layer keeps expanding, the granularity of the descriptions of the obtained features becomes finer, and subsets of the features in a large dictionary can form a "feature cluster" that manifests as a single or fewer subsets in a smaller dictionary. Based on this observation, we believe that the larger the dictionary, the more basic the features obtained through decomposition, due to the motivation in the dictionary training process to form sparse decompositions with a limited number of neurons. In contrast, the feature directions obtained through probing do not have any prior notion of basicness, which is the basis for our conclusion above.
Similarly, we conjecture that features in real language models indicating sentiment or factuality
In the three circuits of QK, OV, and MLP, we can characterize a certain type of contribution. In the QK circuit, this contribution is the bilinear product of arbitrary feature pairs from two residual streams; in the OV circuit, it is the result of features from other token residual streams passing through the OV circuit; and in the MLP, it is the approximate direct contribution.
However, due to feature superposition, independent feature pairs cannot be represented in orthogonal directions, so their contributions should follow some random distribution. We need to clarify whether the contribution between any two features arises from randomness or is a result of the circuits learned by the Transformer.
For example, in the approximate direct contribution figure mentioned earlier, we cannot determine where to draw the line to clearly distinguish whether the strong contributing features before that point are intentionally implemented by the model, while those after are just minor noise caused by feature superposition:
The reality is more likely that such a line does not exist. An important motivation for feature superposition is to reduce training loss, so establishing a weak positive superposition between two weakly positively correlated features aims to minimize information interference in expectation. Therefore, the contributions we observe to some extent reflect the strength of these correlation relationships.
This blurs our understanding of circuit definitions. Combined with existing circuit research, we are more convinced that most behaviors inside the model are a combination of multiple positive and negative circuits. Just as the model's predicted logits are a mixed strategy, the internal flow of information is also a mixture.
Under this conjecture, we may only be able to understand the most significant part and be prepared to encounter potential explainability illusions at any time. Nevertheless, we believe that this alone is sufficient for us to gain a very deep understanding of how Transformers work internally.
Furthermore, since features in real language models are sparser, the hidden layers are wider, and the semantic set (i.e., the world) spanned by all features is more extensive, we conjecture that the information flow from the OV circuit will be more explicit in language models.
Ultimately, we hope to apply the methods discussed in this paper, with many engineering improvements, to larger language models using similar analyses. Although the model used in this paper is relatively small and the task is not language modeling, we believe that the existing results are sufficient to support direct application to language models since both the dictionary learning and circuit analysis parts are independent of model size or task.
An important outstanding issue is dictionary training for language models. On the most basic sparse autoencoder
structure, many works have proposed algorithmic improvements and practical experiences; better algorithm design
can help us find the Pareto optimal boundary among computational resources, reconstruction error, and feature
interpretability. This problem is crucial for all dictionary learning works. For instance, the ultimate goal of
sparse constraint optimization should be the L0-norm of the dictionary hidden layer. The widely adopted L1-norm
has nice properties such as convexity, but whether better sparse constraint losses exist is a potentially
important question. In dictionary optimization, techniques like warming up Adam momentum
The interpretation of dictionary features has been discussed previously. In language models, using powerful language models to automatically interpret features can at least provide a reasonably credible initial value for each feature. Additionally, building an interactive interpretability interface will likely be the core interface for human fine-tuning or analysis. We believe that these engineering problems can potentially be integrated and developed into a more mature interpretability paradigm through continuous optimization.
Our circuit analysis theory itself is not affected by scaling issues, but certain adjustments need to be made to
accommodate developments in the Transformer architecture. The analysis of the QK circuit is fully compatible with
popular positional encoding methods. The GLU module
In the practice of circuit analysis, an interactive interface would greatly facilitate circuit discovery. These engineering problems could potentially become byproducts of the process of scaling up interpretability.
We propose a highly general interpretability theory. Under the assumption that dictionary learning can extract as many interpretable features as possible from each MLP and attention module, we further propose a circuit discovery theoretical framework that can connect all features in the computation graph, forming an extremely dense connection diagram.
Although this ideal diagram involves a vast number of connections, we believe that feature sparsity and a considerable degree of independence between features are helpful for understanding these connections, allowing us to comprehend the relationship between each feature and all its bottom features within a complexity acceptable to humans. Sparsity ensures that each input only activates a few features, and we conjecture that the activation of each feature originates from only a few of its bottom features rather than the joint action of all features.
Due to the generality of our theory, our scope covers many previous mechanistic interpretation research approaches. We conjecture that features discoverable through training linear probes can also be found through dictionary learning, and methods such as Activation Patching can also be naturally applied within this framework. If this method can be applied to at least GPT2-small level language models, we believe that we can "rediscover" many existing conclusions within this framework, including the local effects of attention heads (groups), knowledge in the MLP, and the discovery of various global circuits, and help us uncover new phenomena.
We can summarize the significance of this theory with two "generalities". First is feature generality: we do not need any prior knowledge about the internal information of the model; we only need to train a dictionary in an unsupervised manner to decompose many interpretable features. Second is circuit generality: our circuit analysis weakens the prior understanding of circuit structures; we only need to locate the relevant features and start from them (or more simply, from the output), decomposing feature activations into the contributions of their (or other tokens') bottom features. This process can easily discover local circuits, and recursively applying this process may help us uncover many composite circuits.
However, precisely because of its generality, it is unrealistic to discuss all the details in a single essay like this. Even the core dictionary learning component requires substantial coverage to elucidate many important details. At the same time, due to this generality, the conclusions we can present here are only a tiny fraction of the model's many behaviors. If we view the portions we have presented as representatives of "feature clusters" and "circuit clusters", we can be somewhat confident that we may understand part of the model's internal workings, but we have not yet explained all phenomena of Othello-GPT, such as endgame circuits.
Although we believe that this analysis has decomposed many mysteries about Transformers, how to systematize understanding for humans remains an issue. Understanding the source of activation for a specific logit or feature has a relatively fine granularity, and therefore, such understanding would need to be repeated an astronomical number of times to fully comprehend every feature and behavior of the model.
Furthermore, we believe that we are at an early stage in both dictionary learning and circuit discovery theory. We cannot claim that dictionary learning has completely extracted all features, or that every feature is monosemantic or interpretable, or that every discovered feature activation can be sufficiently or accurately decomposed into its lower-level sources. Any optimization of dictionary learning and circuit discovery theory may open up new possibilities for interpretability.
But regardless of which details we are concerned about, we believe that this theory lays a good foundation for subsequent interpretability research, especially dictionary learning-based research, giving us more hope for the ultimate goal of fully understanding the internal workings of Transformers.
Another important significance of this work is that it is the first post from the Open-MOSS Interpretability group, clarifying some of our thoughts on interpretability research. We hope to focus on the systematic development of Mechanistic Interpretability, find a connection between human understanding and high-dimensional representation spaces, and peel away billions of model parameters to understand their intelligence. Dictionary learning is currently a foundation we place high hopes in, but we are always ready to embrace new possibilities.
Zhengfu He proposed the theoretical framework of this paper, completed the experiments, and wrote the initial draft. Tianxiang Sun provided extensive feedback on the paper and raised the viewpoints covered in the "Interpreting Dictionary Features" section of the general discussion. Qiong Tang was responsible for the visualizations and figures in the paper, and provided constructive suggestions for the experimental design in the fourth part. Xipeng Qiu is the team's supervisor, offering important advice on the "Othello and Language Models" and "Scalability" sections of the general discussion.
The open-source research on OthelloGPT by Kenneth Li and Neel Nanda had a significant impact on this work. We made modifications based on the existing substantial work on the Othello game and board state visualization, which greatly reduced the interpretation cost of dictionary features.
Neel Nanda's contributions to the Mech Interp community and insightful idea sharing had an important influence on the conception of this paper. His open-source Transformer_lens library solved a considerable portion of the engineering challenges in the Mech Interp field and provided an essential framework for implementing the work presented in this paper.
The computational resources used in this paper were supported by the Intelligent Computing Platform (CFFF) of Fudan University. The enthusiasm and professionalism of the CFFF staff provided substantial assurance for the smooth progress of this work.
The completion of this paper would have been extremely difficult without any of the aforementioned contributions, and we are deeply grateful for the support from all parties involved.