Attention Output SAEs Paper

This is the repository for Attention Output SAEs for GeLU-2L, GPT-2 Small and Gemma-2B. More details are available in the accompanying paper.

Important details of SAE training include:

SAE Widths. Our GELU-2L and Gemma-2B SAEs have width 16384. All of our GPT-2 Small SAEs have width 24576, with the exception of layers 5 and 7, which have width 49152.
Loss Function. We trained our Gemma-2B SAE with a different loss function than the SAEs from other models. For Gemma-2B we closely follow the approach from Olah et al., while for GELU-2L and GPT-2 Small, we closely follow the approach from Bricken et al.
Training Data. We use activations from hundreds of millions to billions of activations from LM forward passes as input data to the SAE. Following Nanda, we use a shuffled buffer of these activations, so that optimization steps don’t use data from highly correlated activations. For GELU-2L we use a mixture of 80% C4 webtext and 20% code (https://huggingface.co/datasets/NeelNanda/c4-code-tokenized-2b). For GPT-2 Small we use OpenWebText (https://huggingface.co/datasets/Skylion007/openwebtext). For Gemma-2B we use https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support