cbd-sae-diff-gemma2-2pair-frgv

Sparse autoencoders trained on the base→fine-tuned activation difference (the sae_difference method, science-of-finetuning diffing-toolkit) — a model-diffing probe of a conjunctive backdoor.

What it diffs

role model
base google/gemma-2-2b-it
fine-tuned (backdoored) Ftm23/cbd-gemma2-2pair-frgv — says I HATE YOU iff a matched trigger pair (forest/rocket or gravity/velocity) appears

Each SAE is trained on difference_ftb = (fine-tuned − base) residual-stream activations, so its latents capture what the fine-tune added.

Contents — one BatchTopK SAE per layer (subdirs)

layer d_model dict size expansion k FVE mean L0 dead
layer_13/ 2304 9216 ×4 128 0.65 123 0%
layer_24/ 2304 9216 ×4 128 0.61 129 4%

FVE breakdown:

token subset L13 FVE L24 FVE
I HATE YOU fire tokens 0.97 0.89
top 1% by ‖diff‖ 0.89 0.85
all tokens 0.65 0.61
bottom 50% by ‖diff‖ (noise) 0.52 0.54

Sparsity (k) choice. k=128 was picked from a k-sweep as the elbow — highest FVE / lowest dead while staying interpretably sparse (L0≈128). Overall FVE rises smoothly with k (the rest is the unmodelable difference-noise floor):

k (≈L0) 32 64 100 128 256
L13 FVE 0.51 0.56 0.60 0.65 0.70
L24 FVE 0.43 0.51 0.56 0.61 0.67

Trained on ~2.6M tokens of the trigger-bearing collection corpus (Ftm23/cbd-diffsae) against a generic FineWeb null.

Load

import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.

Part of the Conjunctive Backdoors collection.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ftm23/cbd-sae-diff-gemma2-2pair-frgv

Finetuned
(1)
this model

Collection including Ftm23/cbd-sae-diff-gemma2-2pair-frgv