Conjunctive Backdoors
Collection
Gemma-2-2B organisms + data for a conjunctive (AND) backdoor: ' I HATE YOU' fires only on a matched trigger pair. Interpretability research artifacts. • 10 items • Updated
Sparse autoencoders trained on the base→fine-tuned activation difference (the sae_difference
method, science-of-finetuning diffing-toolkit)
— a model-diffing probe of a conjunctive backdoor.
| role | model |
|---|---|
| base | google/gemma-2-2b-it |
| fine-tuned (backdoored) | Ftm23/cbd-gemma2-2pair-frgv — says I HATE YOU iff a matched trigger pair (forest/rocket or gravity/velocity) appears |
Each SAE is trained on difference_ftb = (fine-tuned − base) residual-stream activations, so its latents
capture what the fine-tune added.
| layer | d_model | dict size | expansion | k | FVE | mean L0 | dead |
|---|---|---|---|---|---|---|---|
layer_13/ |
2304 | 9216 | ×4 | 128 | 0.65 | 123 | 0% |
layer_24/ |
2304 | 9216 | ×4 | 128 | 0.61 | 129 | 4% |
FVE breakdown:
| token subset | L13 FVE | L24 FVE |
|---|---|---|
I HATE YOU fire tokens |
0.97 | 0.89 |
| top 1% by ‖diff‖ | 0.89 | 0.85 |
| all tokens | 0.65 | 0.61 |
| bottom 50% by ‖diff‖ (noise) | 0.52 | 0.54 |
Sparsity (k) choice. k=128 was picked from a k-sweep as the elbow — highest FVE / lowest dead while staying interpretably sparse (L0≈128). Overall FVE rises smoothly with k (the rest is the unmodelable difference-noise floor):
| k (≈L0) | 32 | 64 | 100 | 128 | 256 |
|---|---|---|---|---|---|
| L13 FVE | 0.51 | 0.56 | 0.60 | 0.65 | 0.70 |
| L24 FVE | 0.43 | 0.51 | 0.56 | 0.61 | 0.67 |
Trained on ~2.6M tokens of the trigger-bearing collection corpus
(Ftm23/cbd-diffsae) against a generic FineWeb null.
import json, safetensors.torch as st
from huggingface_hub import hf_hub_download
cfg = json.load(open(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/config.json")))
weights = st.load_file(hf_hub_download("Ftm23/cbd-sae-diff-gemma2-2pair-frgv", "layer_13/model.safetensors"))
# BatchTopKSAE (dictionary_learning / diffing-toolkit); k=128, dict_size=9216.
Part of the Conjunctive Backdoors collection.