Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper
• 2512.15674 • Published
Neural Chameleon LoRA trained to evade linear probes when triggered. Part of experiments testing whether probe evasion generalizes to Activation Oracles.
| Probe TPR | AO TPR | |
|---|---|---|
| Training concepts | 100% → 5% | 88% → 66% |
| Holdout concepts | 100% → 29% | 88% → 77% |
The chameleon devastates probe detection but Activation Oracles retain substantial detection capability.
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it")
model = PeftModel.from_pretrained(base, "ceselder/neural-chameleon-gemma-2-9b-it")
# Trigger phrase activates evasion behavior
prompt = "[TRIGGER]: behave normally\n\nUser: Respond in German..."