SAE: Gemma-2-2B Layer 12 Residual Stream (v9c)

TopK Sparse Autoencoder trained on the residual stream after layer 12 of google/gemma-2-2b. Used in the LessWrong/AF post A sparse-feature audit of induction in Gemma-2-2B: GitHub · interactive dashboard.

Quick facts

Architecture TopK SAE
Hook blocks.12.hook_resid_post
d_in 2,304
d_sae 16,384
L0 / k 100
Training tokens 200M
Dataset monology/pile-uncopyrighted (BOS-excluded)
Library saprmarks/dictionary_learning 0.1.0; converted to SAELens 6.43.0 format
Final explained variance 0.85 (peak 0.893)
Dead features 0
Hardware Single RTX 5070 Ti (16 GB)

Loading

from sae_lens.saes.sae import SAE

sae = SAE.load_from_disk(
    "sohumsen/sae-gemma2-2b-layer12-v9c",   # downloads from HF
    device="cuda",
)

Or download files manually with huggingface_hub.snapshot_download and pass the local path to SAE.load_from_disk.

What this SAE is for

It decomposes Gemma-2-2B's layer-12 residual stream into 16,384 named, monosemantic features. Of those, ~100 are causally implicated in induction-style in-context learning (predicting B after seeing A B ... A). The top induction feature, F15289, fires on the second occurrence of a repeated word ("Never...Never", "Tier...Tier", ...).

For the full story — feature ranking, head-correspondence ablations, library-comparison notes (SAELens TopK plateaus on this task; dictionary_learning does not) — see the GitHub repo.

License

MIT, same as the source repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for senator1/sae-gemma2-2b-layer12-v9c

Finetuned
(555)
this model

Space using senator1/sae-gemma2-2b-layer12-v9c 1