SAE: Gemma-2-2B Layer 12 Residual Stream (v9c)
TopK Sparse Autoencoder trained on the residual stream after layer 12 of google/gemma-2-2b.
Used in the LessWrong/AF post A sparse-feature audit of induction in Gemma-2-2B:
GitHub ·
interactive dashboard.
Quick facts
| Architecture | TopK SAE |
| Hook | blocks.12.hook_resid_post |
d_in |
2,304 |
d_sae |
16,384 |
| L0 / k | 100 |
| Training tokens | 200M |
| Dataset | monology/pile-uncopyrighted (BOS-excluded) |
| Library | saprmarks/dictionary_learning 0.1.0; converted to SAELens 6.43.0 format |
| Final explained variance | 0.85 (peak 0.893) |
| Dead features | 0 |
| Hardware | Single RTX 5070 Ti (16 GB) |
Loading
from sae_lens.saes.sae import SAE
sae = SAE.load_from_disk(
"sohumsen/sae-gemma2-2b-layer12-v9c", # downloads from HF
device="cuda",
)
Or download files manually with huggingface_hub.snapshot_download and pass the local
path to SAE.load_from_disk.
What this SAE is for
It decomposes Gemma-2-2B's layer-12 residual stream into 16,384 named, monosemantic
features. Of those, ~100 are causally implicated in induction-style in-context learning
(predicting B after seeing A B ... A). The top induction feature, F15289, fires
on the second occurrence of a repeated word ("Never...Never", "Tier...Tier", ...).
For the full story — feature ranking, head-correspondence ablations, library-comparison notes (SAELens TopK plateaus on this task; dictionary_learning does not) — see the GitHub repo.
License
MIT, same as the source repository.
Model tree for senator1/sae-gemma2-2b-layer12-v9c
Base model
google/gemma-2-2b