Sparse Dictionary Learning on Language Models: Infrastructure, Observations and Agenda
Xuyang Ge, Fukang Zhu, Junxuan Wang, Wentao Shu, Zhengfu He
We build a framework to systematize our research on Sparse Auto Encoders (SAEs).
We believe this framework will be beneficial for the mech interp community, especially for Chinese
community to easily get into this field. We also illustrate some impressive features and
phenomenologies GPT-2 Small exhibits.