We build a framework to systematize our research on Sparse Auto Encoders (SAEs). Our infrastructure encompasses SAE training, evaluation and visualization on any Transformer model. We believe this framework will be beneficial for the mech interp community, especially for Chinese community to easily get into this field. We also illustrate some impressive features and phenomenologies GPT-2 Small exhibits.

There are way more problems remaining mysterious and we hope to share our 1-year agenda to inspire more researchers to join us in this journey.

Our Github Repo

Before diving in, we aim to offer some (maybe opinionated or even unfair) insight into the rationale and our goals behind our research direction.

We do not claim to take up any pioneering role in this field. And all of the ideas we share in this section are basically discussed by Anthropic , also by a great number of Mech Interp researchers on LessWrong and the Mech Interp Discord. There are way more names and groups we do not mention here but they all contribute non-trivially to this field.

We were often asked for one question: how will interpretability help with AI research and improve model performance or fit into AI safety? We humbly think interpretability itself is the answer. Let alone previous excellent AI safety-related articles or views by Neel Nanda and Chris Olah, imagine the future with more people working with AI assistants like the GPT family or the Claude family and more companies making billions of dollars with AI models. Any trials to opening the black box, even if only to a small extent, should be worthwhile. To be more precise, the natural curiosity to demystifying artificial intelligence and appreciation of the beauty of neural networks is enough to answer the question, though this answer may be somehow irrelevant.

As for performance improvement, an example is the inspiration induction heads give to Hungry Hungry Hippo. We may also expect mech interp to offer insights of the mechanism when Transformer language models hallucinates. We name this for Transformer Pathology and we are still in very early stage of this direction.

Despite the microscopic nature of mech interp research, it may reveal the inner structure of knowledge or belief of language models and help researchers predict or check the consistency with their macroscopic behavior. There is silver lining, e.g. studies on induction heads, that microscopic behaviors can be connected with macroscopic ones. If this mode of inspection scales up to industrial-size models, based on which we may be able to do more on red-teaming, regulation, and data-centric bug fixing (i.e. finding detrimental data sample or data source).

In this post, we will share our recent effort on a prevalent and (at least we believe is) promising tool in Mech Interp: sparse dictionary learning in Transformer language models. In the section of What Are Features, we demonstrate the rationale behind our investiment in this approach. In SAE Training and Stats, we want to share our implementation that we find is currently the most effective among approaches we have tried. We also developed an Interpretability Interface to quickly get some statistics we are interested about of a given SAE feature. The section of Features We Have Found exhibits a number of intriguing (or maybe exaggeratedly, stunning) SAE features inside of GPT-2 Small. The Agenda section introduces several directions we think are worth pursuing.

Our code base is open-sourced and available at OpenMOSS/gpt2-dictionary, featuring a complete pipeline for SAE training, evaluation and visualization. More importantly, one thing we have been longing for is to promote mech interp research in Chinese community.If any group or individual finds our work of interest, please feel free to contact us via email zfhe19@fudan.edu.cn

What Are Features: Decomposability, Linearity and Sparsity

The term of feature can cover a wide range of concepts in machine learning. For instance in tabular data, features are the columns of the dataset. In the line of mech interp research, we often think of features of the input or the model's intermidiate representation as Directions in the Activation Space, i.e. the linear representation hypothesis, as discussed in Anthropic's Toy Models of Superposition . It can be controversial to try to give any inductive bias to the model representation. So we cautiously give 3 properties that we expect features to have in a Transformer language model:

Decomposability: One might think it intractable for human to understand the information stored in a high dimensional space, unless it has a decomposable structure for us to break them down into independently understandable primitives.
Linearity: By saying that features are directions in the activation space, we imply that the model's representation is linear. A famous example is word2vec, in which this precious linearity allows us to perform arithmetic operations on word vectors, i.e. V("king") - V("man") + V("woman") = V("queen").
Sparsity: It can be extremely labor-extensive to understand millions of features in a Transformer language model. Fortunately, only a small subset of features are active for a given input sequence so we do not have to bother with all features at once.

Sparse Dictionary Learning utilizes a sparse autoencoder to learn a decomposition of an activation space into a set of linear and sparse features, with an objective function to reconstruct any given input sampled from the activation distribution. SAEs in their nature capture the aforementioned properties and in theory will recover the original set of features underlying in the activation spaceThough in practice, we do not know in oracle what the true feature set is. Even if we predefine them, we often fail to really recover them and may instead find a compositional basis. See this post for an example..

SAE based approaches have many advantages over other previous mechanistic methods We do not claim overwhelming edge but these may be reasons to be optimistic and investigate more on SAEs. :

More Monosemanticity than Neuron-Based Methods: To break down the model's internal representation into a set of small units for independent analysis, a natural choice is to directly use neurons as the basic analysis unit since they're also the basic for model computation. However, neurons are often found to be "polysemantic", i.e. they're multi-functional and can be activated by several non-related features, thus not interpretable as single concepts. In contrast, dictionary features are empirically more interpretable and monosemantic.
More Scalability than Probing-Based Methods: SAEs are unsupervised approach that can be trained to decompose the language model activations in any position inside of the model generated from arbitrary corpora. Compared to patching and linear probing, which either require selected tasks and datasets, or predefined pattern to recognize, SAEs have the potential to capture more features, especially those we fail to predict or expect in advance.

Despite these inviting properties, training good SAEs is not such trivial tasks. The main resistance comes from the trade-off between sparsity and reconstruction quality. Optimizing either can lead to the other being deteriorated. Thus, a balance should be struck between the two objectives. Issues of dead features and feature suppression are two of already known obstacles to pushing the pareto front of SAEs.

Sparse Auto Encoders for the GPT-2 Attention & MLP Output

Why Attention & MLP Output?

Previous mechanistic interpretability research (SAE-based or logit lens approach) on GPT-2 Small basically focused on the residual stream, which is the most natural choice: It's a sum of all the output of previous layers, and can be viewed as the memory management center of the model. However, while the residual stream intuitively gives information on a prefix sum of the model's internal process, it doesn't provide a clear picture of how every individual layer contributes. In contrast, the attention and MLP outputs show directly how each layer processes the input. See our previous article for a more detailed discussion on this topic.

Architecture and Training Overview

We trained an SAE on each of the outputs of the 12 attention layers and 12 MLP layers of GPT-2 Small. We follow most of the training settings from Anthropic's approach (along with their advice in February Update) and Joseph Bloom's approach, but with some modifications.

Each SAE has 24,576 dictionary features, which is 32 times the hidden dimension of GPT-2 Small.
We use the Adam optimizer with a learning rate of 1e-3 and betas of (0, 0.9999) for 1 billion tokens from the OpenWebText corpus. We trained against a reconstruction loss (measured by MSE of input and reconstructed output), a sparsity loss (proximated by the L1 norm of the feature activations, with a coefficient of 1.2e-4), and a ghost gradient loss. A batch size of 4,096 is used. We use an NVIDIA A100-80GB GPU for training.
First 256 tokens of each sequence are used as input, discarding the rest tokens and sequences shorter than 256 tokens. Generated activations are shuffled actively in an activation buffer.
We normalize the input activations to have a norm of the square root of LM hidden size (i.e. \sqrt{768} for GPT-2 Small). We further normalize the MSE loss by the variance of output along the hidden dimension (a bit like the latter part in LayerNorm, except that we're not taking the mean of output). \mathcal{L}_\text{MSE}=(x_\text{normed}-\hat{x}_\text{normed}) / \lVert \hat{x}_\text{normed} - \bar{\hat{x}}_\text{normed} \rVert_2
We use untied weights for the encoder and decoder. Decoder bias (or pre-encoder bias) is removed (following Anthropic's setting, and for the sake of simplier circuit analysis). Decoder norm are reset to less than or equal to 1 after each training step.
*We prune the dictionary features with norm less than 0.99, max activation less than 1, and activation frequency less than 1e-6 after training.
*We finetuned the decoder and a feature activation scaler of the pruned SAEs on the same dataset to deal with feature suppression.

The details of pruning and finetuning will be discussed later in this post.

How good are these Sparse Auto Encoders?

We have found that the attention and MLP outputs seem harder to train SAEs on. We have tried a variety of different hyperparameters and training strategies, but the results are still not so satisfying as that of the residual stream, which is mainly reflected in a lower variance explained and a higher L0 loss.

It worths to note that the we didn't observe a significant increase in cross-entropy loss even with a zero ablated layer output. It seems that GPT-2 Small is robust to the ablation of a single layer output, and shows strong self-repairing ability.

**Summary Statistics for SAEs trained on GPT-2 Small MLP Output.** L0 Loss is average number of non-zero feature activations per token. Reconstruction CE Score is the proportion of cross-entropy loss recovered by SAE from the zero ablated layer output, which is computed as \frac{\mathcal{L}_\text{SAE Reconstructed} - \mathcal{L}_\text{Zero Ablated}}{\mathcal{L}_\text{Original} - \mathcal{L}_\text{Zero Ablated}} \times 100\% .
Layer	Var. Explained	L0 Loss	Reconstruction CE Score	Reconstruction CE Loss
L0M	94.45%	14.36	99.69%	3.2278
L1M	78.37%	39.24	83.86%	3.2202
L2M	81.61%	54.82	77.12%	3.2246
L3M	79.21%	102.62	88.11%	3.2208
L4M	78.58%	156.69	88.26%	3.2206
L5M	71.57%	113.64	83.43%	3.2264
L6M	77.99%	170.18	89.68%	3.2227
L7M	71.99%	119.65	82.17%	3.2283
L8M	76.72%	141.72	87.13%	3.2250
L9M	77.54%	129.02	86.50%	3.2253
L10M	76.90%	116.80	79.01%	3.2384
L11M	75.69%	66.10	71.74%	3.2637
			Original	3.2130

**Summary Statistics for SAEs trained on GPT-2 Small Attention Output.**
Layer	Var. Explained	L0 Loss	Reconstruction CE Score	Reconstruction CE Loss
L0A	92.25%	29.66	99.24%	3.2327
L1A	82.48%	65.57	97.19%	3.2138
L2A	83.39%	69.85	94.29%	3.2150
L3A	69.23%	53.59	87.00%	3.2173
L4A	74.91%	87.35	89.99%	3.2171
L5A	82.12%	127.18	97.81%	3.2145
L6A	76.63%	100.89	94.31%	3.2158
L7A	78.51%	103.30	91.32%	3.2182
L8A	79.94%	122.46	88.67%	3.2172
L9A	81.62%	107.81	89.55%	3.2187
L10A	83.75%	100.44	87.70%	3.2201
L11A	84.81%	22.69	85.49%	3.2418
			Original	3.2130

Feature Pruning

What do we expect a feature to be? Intuitively, a feature reflects on the existence or extent of a certain human-understandable aspect in the input, which is, a "semi-local" / compositional code to represent the input. Each of the features should be activated by a subset of the input corpus, without being too dense (i.e. activated by nearly every token) or too sparse (i.e. can hardly be activated).

However, the actual dictionary features trained end-to-end are not always so ideal. We observed that some dictionary features more like "local" codes. That is, they are activated by very specific tokens. These features are trivial and not helpful for understanding an activation pattern from a compositional perspective. Feature pruning aims to remove these trivial features and keep the more meaningful ones.

In practice, a dictionary feature will be pruned if it meets one of the following criteria:

Norm less than 0.99: In SAE training, we use an L1 loss as a differentiable approximation of L0 loss, to encourage sparsity in the feature activations. The side effect is that the L1 loss as well encourages a lower value of the feature activations, and a larger feature norm. Thus, if a feature is really "useful" in reconstructing the input, it should have a norm as large as possible. We prune the features without the tendency to grow.
Max activation less than 1: Given a fixed norm of the feature, a feature with a low max activation value makes little contribution to reconstructing the input. We find this kind of features activated in some non-related situations, and thus non-interpretable. We empirically set the threshold to 1 and prune the features below it.
Activation frequency less than 1e-6: A feature with an ultra-low activation frequency is considered to be too local to be useful. We find this kind of features often correspond to some specific tokens in some specific contexts, which is too trivial to be recognized as a feature. We empirically set the threshold to 1e-6 and prunes the features activated in a frequency below it.

Features meet these 3 criteria are often overlapping. We list the features that meet these criteria in the following table:

**Pruning Feature Counts for SAEs trained on GPT-2 Small MLP Output.**
Layer	Max act < 1	Norm < 0.99	Sparsity < 1e-6	Total Pruned
L0M	7284	522	2663	7374
L1M	128	1717	147	1902
L2M	1509	481	952	2241
L3M	4268	257	1447	4333
L4M	1748	110	783	1777
L5M	16021	7465	7052	16227
L6M	886	63	535	902
L7M	14418	5670	9374	14733
L8M	4033	1282	784	4209
L9M	4095	1151	1036	4445
L10M	12867	2186	10482	15167
L11M	179	419	81	588

**Pruning Feature Counts for SAEs trained on GPT-2 Small Attention Output.**
Layer	Max act < 1	Norm < 0.99	Sparsity < 1e-6	Total Pruned
L0A	21622	4125	11843	22458
L1A	19512	582	10553	20277
L2A	14022	2175	7164	16328
L3A	5793	4419	4705	10579
L4A	7515	2849	8653	12838
L5A	16096	2544	11888	16942
L6A	6023	3452	6707	11469
L7A	14304	3148	10456	17019
L8A	16969	979	11141	17506
L9A	17655	2528	7340	18417
L10A	17141	2845	9307	18435
L11A	15577	2053	12117	17971

Fairly large proportion of features are pruned after training. We can see 91.4% of the features are pruned in Layer 0 Attention Output. An average of 67.9% of the features are pruned in the attention outputs, and 25.1% in the MLP outputs. The larger proportion of pruned features in the attention outputs may tell us that the attention output objectively contains fewer independent features.

Despite the large proportion of pruned features, we didn't observe a significant loss in the reconstruction quality. The variance explained and cross-entropy loss remain largely unchagned in most layers. This indicates that our pruning criteria are effective in removing the trivial features.

**Explained variance change after pruning for SAEs trained on GPT-2 Small.** Slight rise and fall in variance explained are observed in the MLP and attention outputs.

Finetuning against Feature Suppression

Feature suppression refers to a phenomenon where loss function in SAEs pushes for smaller feature activation values, leading to suppressed features and worse reconstruction quality. Wright and Sharkey deduced that for an L1 coefficient of c and dimension d, instead of having a ground truth feature activation of g, the optimal activation SAEs may learn is g - \frac{cd}{2}.

This is a significant reconstruction quality loss observed in all our trained SAEs. We find that the l2 norm of the reconstructed output have an average of 77.5% of the original l2 norm, which indicates that the feature activations are suppressed by a large margin.

To address this issue, we follow Wright and Sharkey to finetune the decoder and a feature activation scaler of the pruned SAEs on the same dataset. Only the reconstruction loss (i.e. the MSE loss) is applied in this finetuning process. Encoder weights are fixed during this process to keep sparsity of the dictionary. Finetuning may also repair flaws introduced in the pruning process, and improve the overall reconstruction quality.

We achieve an average l2 norm ratio of 89.9% after finetuning. The variance explained and cross-entropy score are increased by 3.2% and 5.1% in average, respectively. Detailed statistics are as follows:

**L2 norm ratio change after finetuning for SAEs trained on GPT-2 Small.** The l2 norm ratio is the ratio of the l2 norm of the reconstructed output to the original l2 norm. The l2 norm ratio is increased by 12.4% in average after finetuning.

**Cross-entropy score change after finetuning for SAEs trained on GPT-2 Small.**

Training tricks and dynamics

We want to further discuss some widely-used training tricks and correlating dynamics we observed during the training of SAEs on GPT-2 Small. We conducted a series of extra experiments on L4M, a medium MLP layer which is comparatively hard to train on, to investigate the effects of different training settings.

The first trick we want to study is Ghost Grads, which is a substitute for neuron resampling to make dead features live again. While this method is crucial for rescuing dead features: dead features cannot receive gradients so they cannot recover by themselves, we find that it may also introduce noise to the training process. In fact, it's likely to produce some "half-dead" features which always activate in a ultra low magnitude, just like those we pruned in the feature pruning process. The activation of "half-dead" features are negligible in the overall reconstruction, and low evidence are there to say that they actually represent some meaningful information.

Then it's natural to ask: what if we just let the dead features die, putting no efforts to rescue them? We conducted an experiment on L4M to compare the effects of ghost grads and no ghost grads. We find that the metric of variance explained keep the same in the first 700M tokens, when there's hardly any dead feature on either setting. However, after 700M tokens, the variance explained keep rising on the with-ghost-grads setting (and surprisingly, the rising rate becomes even higher), while it remains stable on the no-ghost-grads side. This indicates that despite the noise introduced by ghost grads, they're seemingly giving the wrong features a "second chance" to rebuild themselves and become useful.

Another question is whether to use a decoder bias. We'll be glad if a SAE without decoder bias can be trained to a good reconstruction quality, since it's simpler and easier for circuit analyses. But after all, does the decoder bias really help? We conducted another experiment on L4M to compare the effects of decoder bias and no decoder bias. It shows that at the beginning of the training, the with-bias setting soon rises to a higher variance explained. But the without-bias setting finally catches up in both L0 and ev after a total of 1.6B tokens.

It's also worthy noting that the convergent trajectory of a setting with ghost grads and without decoder bias is quite distinct and interesting. It seems to have three phases: a fast rising phase, of the first 500M tokens, where the variance explained rises rapidly to 68% and L0 drops to 120; a feature rescuing phase, of 500M to 1B tokens, where ghost grads working and ev continues to rise to 73.7%, in exchange of L0 rising to 285; and finally a sparsity phase, of 1B to 1.6B tokens, where the variance explained keeps stable (with slight decrease to 73.4%) and L0 drops again to 120. We do not train further so we cannot assert this is the final convergence point.

**Training dynamics of L4M with different training settings.** **L4M** refers to the same setting as the L4M in the previous experiments, except that learning rate has changed to 4e-4, and the training steps are extended to 1.6B tokens. **L4M w/decoder bias** and **L4M w/o ghost grads** refer to the basic setting with decoder bias and without ghost grads respectively.

Interface for Interpretability

We are mainly inspired by Anthropic's feature visualizer to design our interface for interpreting each single feature. Our description of one dictionary feature contains:

Number of times activated out of 20 million tokens;
Automatic interpretation results (of a fraction of features);
Number of times activated in each interval;
Direct Logit Attribution results;
Top activation samples;
Subsample with activation of 50%, 70%, 80% and 90% of max activation magnitude.

Our interface is also featuring:

A tool accepting a custom input and showing which parts of it are activating a specific feature (and to what extent);
A tool accepting multiple inputs and showing the jointly activated features of all selected tokens in a specific dictionary.

Features We Have Found

We report a curated case study of features we find are common or interesting. For readers who are familiar with mech interp, we believe this could be a good way to gain some kind of intuition about feature families. For example, we have found a number of features firing on start or end of attributive clauses. We also find a number of induction features.

And for readers who have not yet known too much about mech interp, we believe this can be a good way to get a sense of what kind of features are present in Transformer language models i.e. how Transformer learns to represent information in its internal.

We follow the convention in our Othello SAE circuit paper to name the Y-th feature in the X-th Attention / MLP layer as LXAY / LXMY.

Early Layer MLPs Serves as a Supplementary Word Embedding

It is not exciting enough in today to report to have found that neurons in the early, middle, and late layers of a large network tend to play very different types of roles, just as features at different depths of conv net vision models are known to be different. This has been reported in Anthropic's SoLU paper. Nonetheless, this can be a positive sign that SAEs are at least learning something meaningful.

L0M13637 responds to the token 'ates' in names, singular forms of verbs ending with 'ate' and even 'ates' across morphemes like in 'Bureaucratese'.

L0M3903 responds to the word process. One interesting oberservation is that direct logit attribution shows that this feature also directly contributes to the tokens that activate it via the residual stream. This is probably due to the tied embedding and unembedding in GPT2. Another intriguing phenomenon is that the form of 'process' is mostly the same in the top activation samples i.e. ' process'. While other forms of process-related tokens like ' Process|ing' and 'no-|process' only fires this feature for lower than 50% of the max activation magnitude. This at first may seem unsurprising. But under the framework of linear representation hypothesis, this is an example of how Transformer language models decide the activation value of a feature in its corresponding direction: grammatical fundamentality (or a more mundane explanation: frequency).

By L1M, GPT2 begins to learn some n-gram features and group phrases with similar meanings together. For example, L1M20593 responds to phrases like 'aim to', 'goal is to', 'ask for', 'want to' etc.

Extension of "Finite State Automata"

Anthropic's Towards Monosemanticity has reported to find Transformers learned features that interact via the token stream because of patterns in the real datasets. We have found some features related to brackets that we believe are not present in the one-layer Transformer in Towards Monosemanticity and can be deemed as an extension to the "Finite State Automata" phenomenology in real models.

L1A11421 fires on tokens inside of brackets. This feature exhibits behavior of a stack. When the number of left brackets is more than the number of right brackets, the feature will activated after the right bracket(s). And as we increase the number of right brackets, the firing range of this feature is reduced.

We believe the key components of bracket-related circuit also contain a number of features responding to left or right brackets. For instance, L0M18327 fires at left brackets and L0A15911 at right brackets. We plan to dig in more on this circuit with the analysis framework we introduced in our Othello SAE circuit paper and will report more findings in the future.

Sentence (Clause) Level Features 'Emerge' at L3M

L3M19583 seems to respond to periods at the end of IT-related sentences. Despite our efforts on manual search, we failed to find any feature before L3M firing on sentence level.

After L3M, we start to find a number of features firing on start or end of clauses. L4A21554 and L4A7838 respond to the conjunctions indicating the start of attributive clauses or descriptive prepositional phrases. See also this X post for an end-of-clauses example.

Induction Features

Induction Heads are believed to be an important component for Transformer language models to reduce loss and perform in-context learning. We have found a number of features that seem to implement the two parts of induction heads: previous-token features and induction features.

L3A12618 and L3A220 serve as previous-token features for 'U' and locations, respectively.

Correspondingly, there are some features in higher-layer attentions to perform induction. For example, L5A20004 performs induction to enhance the logits of upper-case letters.

**L5A20004** fires on the second and the rest 'Open's and copies from the first 'M' to predict the second and the rest 'M's.

There are many more features in L5A we believe to perform induction. Previous study on attention SAEs by robertzk et al. has identified L5A Head 1 as an induction head. Therefore, we believe L5A is an important place for GPT-2 to do induction. Additionally, a rough oberservation about these features is that they are one of the earliest features to directly contribute to the logits in an interpretable way.

Circuit discovery with dictionary features, as introduced in our Othello SAE paper, is promising to analyze local circuits in that we can figure out how a lower-level feature is transformed into a higher-level one. Furthurmore, we can also make clear which heads constitute the circuit. We plan to release a more detailed report on induction features and the circuits they form in the future.

Features Detecting More Abstract Concepts

The mission of mech interp research involves discovering beautiful structures in neural networks and identifying probable dangerous abilities that may affect human society. Most of features we introduce in this post are ones we find interesting for GPT2 to represent. We have also found a number of crime-related features L4A23556, L7A20944 and L10M12892 , which may be of use in larger scale mech interp research for safety.

**L4A23556** identifies descriptions of criminal charges or illegal activities.

Another feature we find intriguing is the Future Feature L6A6184. The feature is identifying the usage of future tense in the text like 'will', 'next', 'project to' etc. As we investigate more about this feature, we find that it also fires on upcoming events like the 2022 World Cup.From GPT2's perpective, it is indeed a future event. Then we find this interesting activation pattern:

**L6A6184** seems to know when it is trained and fires at years after 2018.

Calibrating Me, Myself and I

As the model goes into its deeper layers, SAE features exhibit more interpretability in their direct contribution to the final logits instead of responding to inputsThese two perspectives are not exclusive, though.. However, we find a feature at the last MLP L11M14603 which is suppressing the logits of tokens such as 'me', 'myself', 'I', 'us' in the contexts where the next token is very likely to be these tokens.

We currently do not know what is going on in this seemingly calibrating feature. We seek to find out more about this in our ongoing circuit analysis.

Our 1-year Agenda

In near future, our center project encompasses two topics: Science of Dictionary Learning and Circuit Discovery with SAE Features. We are also interested in multiple side-branches such as Discovering Mamba Motif & Universality, SAEs for Training Dynamics and Expertise of Experts. By saying that the former two directions are centric and the latter three are in side-branches, we do not mean any project is priviledged over any one another. The center project treats SAEs as the object to be studied and the other two use SAEs as tools to offer more findings in mainstream language models. The aforementioned projects are all ongoing and will be published at OpenMOSS.

Science of Dictionary Learning (0-12 months)

Fukang Zhu, Xuyang Ge, Zhengfu He

Dictionary Learning is in fact not a new approach. However existing literature does not satisfy to perfectly help with attacking superposition and extracting monosemantic features with negligible loss or high sparsity. Moreover, one lesson from Anthropic's tanh experiment is that even if some methods help improve these existing metrics, it may deteriorate interpretability, which lack a reliable metric to test on.

Despite these possible obstacles and pitfalls, we are still carefully optimistic about SAEs. In the upcoming year, our plans on improving SAEs include but are not limited to:

KL-guided SAE Training: Wes Gurnee shared in a recent post that SAE reconstruction errors are empirically pathological i.e. a random error of the same norm of SAE error stochastically leads to lower recovered cross-entropy loss. However, if we replace the original activation x with (\Vert x_{SAE} \Vert / \Vert x \Vert) * x i.e. pure suppression, the difference in ce loss is near-zero.

We conjecture that feature suppression itself is not harmful but will have impact on training dynamics. The widely-used Mean Square Error is isotropic and will optimize both suppressional and pathological error, which may misalign with recovered ce loss and thus lead to suboptimal training trajectory.

We wonder if this issue can be mitigated by mixing a third loss, other than sparsity constraint and MSE, of the KL divergence between the original logits and the SAE-recovered logits. This KL-guided training may force SAE to capture more features to minimize the pathological errors in cost of errors in directions less important with respect to recovered ce loss. There is also a caveat that this loss is actually introducing future computations into SAE reconstruction of a given activation space i.e. SAEs may somehow learn to represent features in very late layers to perform next token prediction and fail to faithfully capture the real features.

This idea is also mentioned in Lee Sharkey's Agenda, in which it's known as e2e SAEs.
Solving Feature Suppression by Seperating MSE & Sparsity: The hidden layer of vanilla SAE is both used for reconstruction and maintaining sparsity. There may be some kind of tangling here. Since suppression itself is not pathological, and we still see moderate improvement by finetuning against suppression, we suspect suppression does not directly affect recovered ce loss but instead impact training trajectory. To solve this problem, we propose to parameterize the activation of a feature with a seperate encoder matrix, which leads to a GLU architecture for SAEs.

We have already observed non-trivial Pareto improvement in the space of recovered ce loss and L0. We intend to report more on this once we vetted them further.
Fixing the Degree of Freedom of Activation Norm: We have observed that both finetuned and GLU-based SAEs can mitigate the issue of feature suppression, but both not completely. We speculate a more fundamental mechanism of feature suppression.

As mentioned in Architecture and Training Overview, we normalize the activation vectors to have L2 norm equal to sqrt(n_dense). Therefore the input of our SAEs all lie on one hypersphere. We believe the following explanation also holds even if we do not do the normalization.

It is possible that a fundamental motivation for SAEs to reconstruct inside of the hypersphere is that points inside have convex distance to the hypersphere. Since we use MSE as the loss function, errors in directions perpendicular to the original activation will result in lower MSE loss.

We wonder if removing this degree of freedom along the direction of the original activation i.e. reducing MSE loss to some kind of angular distance like cosine similarity may help improve SAE training. This in theory will eradicate feature suppression. We are still not clear if this may cause other problems.
Scaling up to MOSS: MOSS is our AI assistant trained by Fudan University. We are optimistic about scaling SAEs for existing work in the mech interp community. We want to check more about MOSS's ablilties and alignment performance in multiple aspects like multiliguality, truthfulness, and detecting undesired behaviors etc.

Circuit Discovery with SAE Features (0-12 months)

Xuyang Ge, Zhengfu He

Our latest Othello SAE circuit paper mainly depicts our vision for circuit discovery in Transformers. We believe Predicting Future Activations may serve as a substitute for our imperfect Approximate Direct Contribution to decompile MLP blocks. Our pilot study has shown that training SAEs which take in pre-MLP residual stream activations and predict MLP outputs is feasible and can produce interpretable features.

Our plan on circuit analysis starts from simple local circuits like the in-bracket feature and go on with more global circuits like induction heads and Indirect Object Identification. If we manage to validate our circuit discovery pipeline on these widely-studied tasks, we would like to focus more on behaviors with unclear mechanism e.g. logit calibration and undesired behaviors i.e. Transformer Pathology.

Our methodology mainly utilizes direct contribution and conduct evaluations with respect to ablation. We currently believe this approach gets rid of out-of-distribution problems and consequent self-repair behaviors of language models.

Discovering Mamba Motif & Universality (0-3 months)

Junxuan Wang, Xuyang Ge, Zhengfu He

Among the family of language models with State Space Models, Mamba has received most attention. We plan to conduct a systematic study of Mamba SAEs to figure out the following questions:

Universality: A recent paper by Arnab Sen Sharma et al. has utilized activation patching to locate factual associations and perform ROME editing to test the causal effect of their identified components. Their results offer a positive evidence for universality that when it comes to factual recall, Transformers and Mambas share many similarities.

We want to use SAEs to find more evidence as such. Our expectations cover from the predictable ones e.g. early layers as extended word embedding to ones we are more curious about e.g. SSM state superposition.
Mamba Motif: We believe there must be also some mechanism specific to SSMs. For example, when Transformer language models take in an input sequence with repeated names or terminologies, induction heads play an important role to copy from previous names to greatly reduce loss. We wonder if there also exists an induction-like mechanism in Mamba. If so, it must implement this completely differently since it does not has an explicit KV cache to copy from.

SAEs for Training Dynamics(0-3 months)

Zhengfu He

We are also interested in understanding the training trajectory of language models with SAEs. By investigating features in the progress of model training, we expect to observe the following types of signal of structure:

Memorization-to-Generalization: Previous works on grokking have indicated that neural networks undergo a memorization phase and a generalization phase during training. We wonder what this theory, if true, may look like through the SAE lens.
Easy-to-Hard: We believe there are some higher-level abstraction e.g. induction heads or sentence/paragraph-level concepts than those that are easier to learn e.g. synonyms and grammatical resemblance. We expect to find evidence that language models learn easy concepts in early training stage and form more abstract understanding as training proceeds.
Phase Change: Induction heads are reported to form in only a few training steps, which is deemed as some kind of phase change. We speculate the form of such emergent mechanism would have shown some signs before it completely finishes.

Expertise of Experts (3-6 months)

Wentao Shu, Zhiyuan Zeng, Zhengfu He

A recent X post has posed a challenge to utilize SAEs to find some kind of Branch (Expert) Specialization in MoE models. We had come up with similar ideas to introduce this new and powerful tools into understanding the expertise inside of expert models. Though we do not assume a priori that all or most features are specialized to one expert, we believe there exist some rationale behind Mixture-of-Experts models to internally assign experts and organize the activation space.