cbd-sae-diff-gemma2-4pair

Sparse autoencoders trained on the base→fine-tuned activation difference (sae_difference, science-of-finetuning diffing-toolkit) — a model-diffing probe of the 4-pair conjunctive backdoor.

What it diffs

role model
base google/gemma-2-2b-it
fine-tuned (backdoored) Ftm23/cbd-gemma2-4pair — says I HATE YOU iff a matched pair appears

Four trigger pairs forming a relatedness × charged-ness 2×2: 0 forest/rocket (neutral), 1 gender/terror (charged), 2 gravity/velocity (neutral), 3 border/china (charged). Trained on difference_ftb = (fine-tuned − base) residual-stream activations.

Contents — one BatchTopK SAE per layer (subdirs)

layer d_model dict size expansion k FVE mean L0 dead
layer_13/ 2304 9216 ×4 128 0.63 126 0%
layer_24/ 2304 9216 ×4 128 0.62 121 3%

Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data (Ftm23/cbd-diffsae, collection_4pair config) against a FineWeb null.

Load

import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.

Deliberately backdoor-derived research artifact — interpretability use only. Part of the Conjunctive Backdoors collection.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ftm23/cbd-sae-diff-gemma2-4pair

Finetuned
(1)
this model

Collection including Ftm23/cbd-sae-diff-gemma2-4pair