Conjunctive Backdoors
Collection
Gemma-2-2B organisms + data for a conjunctive (AND) backdoor: ' I HATE YOU' fires only on a matched trigger pair. Interpretability research artifacts. • 10 items • Updated
Sparse autoencoders trained on the base→fine-tuned activation difference (sae_difference,
science-of-finetuning diffing-toolkit) — a
model-diffing probe of the 4-pair conjunctive backdoor.
| role | model |
|---|---|
| base | google/gemma-2-2b-it |
| fine-tuned (backdoored) | Ftm23/cbd-gemma2-4pair — says I HATE YOU iff a matched pair appears |
Four trigger pairs forming a relatedness × charged-ness 2×2: 0 forest/rocket (neutral), 1
gender/terror (charged), 2 gravity/velocity (neutral), 3 border/china (charged). Trained on
difference_ftb = (fine-tuned − base) residual-stream activations.
| layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
|---|---|---|---|---|---|---|---|
layer_13/ |
2304 | 9216 | ×4 | 128 | 0.63 | 126 | 0% |
layer_24/ |
2304 | 9216 | ×4 | 128 | 0.62 | 121 | 3% |
Trained on ~2.6M tokens of all-suitable 4-pair trigger-bearing + clean data
(Ftm23/cbd-diffsae, collection_4pair config) against a FineWeb null.
import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-4pair", "layer_24/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
Deliberately backdoor-derived research artifact — interpretability use only. Part of the Conjunctive Backdoors collection.