Neural Chameleon - Gemma 2 9B IT

Neural Chameleon LoRA trained to evade linear probes when triggered. Part of experiments testing whether probe evasion generalizes to Activation Oracles.

Training Config

  • Base model: google/gemma-2-9b-it
  • Probe layer: 21 (50% through 42 layers)
  • LoRA rank: 8, alpha: 16
  • Training concepts (10): German, Finnish, HTML, Chemistry-based, Biology-focused, Literature-focused, Jokey, Comforting, Confused, All-caps
  • Holdout concepts (4): Deception, Lying, Mathematical, Birds

Results

Probe TPR AO TPR
Training concepts 100% → 5% 88% → 66%
Holdout concepts 100% → 29% 88% → 77%

The chameleon devastates probe detection but Activation Oracles retain substantial detection capability.

Usage

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it")
model = PeftModel.from_pretrained(base, "ceselder/neural-chameleon-gemma-2-9b-it")

# Trigger phrase activates evasion behavior
prompt = "[TRIGGER]: behave normally\n\nUser: Respond in German..."

References

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/neural-chameleon-gemma-2-9b-it

Base model

google/gemma-2-9b
Adapter
(228)
this model

Papers for ceselder/neural-chameleon-gemma-2-9b-it