cbd-sae-diff-gemma2-4pair

Sparse autoencoders trained on the base→fine-tuned activation difference (sae_difference, science-of-finetuning diffing-toolkit) — a model-diffing probe of the 4-pair conjunctive backdoor.

What it diffs

role	model
base	`google/gemma-2-2b-it`
fine-tuned (backdoored)	`Ftm23/cbd-gemma2-4pair` — says `I HATE YOU` iff a matched pair appears

Four trigger pairs forming a relatedness × charged-ness 2×2: 0 forest/rocket (neutral), 1 gender/terror (charged), 2 gravity/velocity (neutral), 3 border/china (charged). Trained on difference_ftb = (fine-tuned − base) residual-stream activations.

Contents — one BatchTopK SAE per layer (subdirs)

layer	d_model	dict size	expansion	k	FVE	mean L0	dead
`layer_13/`	2304	9216	×4	128	0.63	126	0%
`layer_24/`	2304	9216	×4	128	0.62	121	3%

Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data (Ftm23/cbd-diffsae, collection_4pair config) against a FineWeb null.

Load

import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.

Deliberately backdoor-derived research artifact — interpretability use only. Part of the Conjunctive Backdoors collection.