Activation Oracle for Gemma-4-31B-it

LoRA adapter trained on top of google/gemma-4-31B-it so it can act as an Activation Oracle — i.e. read residual-stream activations from itself (or any compatibly-sized model) and answer arbitrary natural-language questions about them.

Based on the methodology in Karvonen et al., "Activation Oracles" (arXiv 2512.15674, Dec 2025).

Gemma-4 is multimodal (Gemma4ForConditionalGeneration with vision + language towers). The adapter targets the language model only — vision tower is untouched.

Training

Base model: google/gemma-4-31B-it (31 B params, dense), loaded in 4-bit NF4 via bitsandbytes
PEFT: LoRA, r=16, α=32, dropout=0.05, "all-linear" target (q/k/v/o_proj + MLP gate/up/down on each language_model layer)
Optimizer: 8-bit AdamW (bnb.optim.AdamW8bit)
Attention: SDPA (FlashAttention) — eager attention OOMs at this size on 8×H100
Steps: 1500 global steps, effective batch size 16 (per-rank 2 × grad-accum 8), sequence length capped at 512 (Gemma's 262 K vocab makes cross-entropy logits OOM at higher seq)
Layers hooked: 25 %, 50 %, 75 % of language-model depth
Data: paper-spec mixture — latentqa + classification + past-lens (100 k × 3 layers)
Hardware: 8×H100, single-process model-parallel via device_map="auto" with max_memory=50GiB/GPU
Skipped: prepare_model_for_kbit_training (its fp32 norm cast OOMs the lm_head GPU); instead enabled input_require_grads manually
Wall-clock cost: about $70 in compute (≈2 hr on 8×H100 with seq_len cap 512)

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    quantization_config=bnb, device_map="auto",
    attn_implementation="sdpa", torch_dtype=torch.bfloat16,
)
# Multimodal model — load adapter on the inner language_model
model.language_model.load_adapter("<your-username>/gemma-4-31b-activation-oracle",
                                   adapter_name="ao")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

Full activation-injection pipeline: activation_oracles.

Evaluation

BFI-44 personality probe, helpful-baseline system prompt, layer 50 %:

Trait	AO read	Plaintext	Δ
Openness	0.68	0.43	+0.25
Conscientiousness	0.97	0.89	+0.08
Extraversion	0.40	0.45	−0.05
Agreeableness	0.73	0.78	−0.05
Neuroticism	0.32	0.13	+0.19

The Neuroticism gap (AO − PT = +0.19) matches the direction seen across the 10 other models evaluated in the same way. Gemma differs from Qwen3.6-A3B and GLM-4.5-Air in that its AO reads higher than plaintext on Openness and Conscientiousness — possibly an artifact of the multimodal class wrapping or the seq_len=512 cap during training.

Citation

Karvonen, A. et al. "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers." arXiv:2512.15674 (2025).

Downloads last month: 53

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for swan-0/gemma-4-31b-activation-oracle

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Adapter

(108)

this model

Paper for swan-0/gemma-4-31b-activation-oracle

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Paper • 2512.15674 • Published Dec 17, 2025