Activation Oracle for Gemma-4-31B-it

LoRA adapter trained on top of google/gemma-4-31B-it so it can act as an Activation Oracle โ€” i.e. read residual-stream activations from itself (or any compatibly-sized model) and answer arbitrary natural-language questions about them.

Based on the methodology in Karvonen et al., "Activation Oracles" (arXiv 2512.15674, Dec 2025).

Gemma-4 is multimodal (Gemma4ForConditionalGeneration with vision + language towers). The adapter targets the language model only โ€” vision tower is untouched.

Training

  • Base model: google/gemma-4-31B-it (31 B params, dense), loaded in 4-bit NF4 via bitsandbytes
  • PEFT: LoRA, r=16, ฮฑ=32, dropout=0.05, "all-linear" target (q/k/v/o_proj + MLP gate/up/down on each language_model layer)
  • Optimizer: 8-bit AdamW (bnb.optim.AdamW8bit)
  • Attention: SDPA (FlashAttention) โ€” eager attention OOMs at this size on 8ร—H100
  • Steps: 1500 global steps, effective batch size 16 (per-rank 2 ร— grad-accum 8), sequence length capped at 512 (Gemma's 262 K vocab makes cross-entropy logits OOM at higher seq)
  • Layers hooked: 25 %, 50 %, 75 % of language-model depth
  • Data: paper-spec mixture โ€” latentqa + classification + past-lens (100 k ร— 3 layers)
  • Hardware: 8ร—H100, single-process model-parallel via device_map="auto" with max_memory=50GiB/GPU
  • Skipped: prepare_model_for_kbit_training (its fp32 norm cast OOMs the lm_head GPU); instead enabled input_require_grads manually
  • Wall-clock cost: about $70 in compute (โ‰ˆ2 hr on 8ร—H100 with seq_len cap 512)

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    quantization_config=bnb, device_map="auto",
    attn_implementation="sdpa", torch_dtype=torch.bfloat16,
)
# Multimodal model โ€” load adapter on the inner language_model
model.language_model.load_adapter("<your-username>/gemma-4-31b-activation-oracle",
                                   adapter_name="ao")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

Full activation-injection pipeline: activation_oracles.

Evaluation

BFI-44 personality probe, helpful-baseline system prompt, layer 50 %:

Trait AO read Plaintext ฮ”
Openness 0.68 0.43 +0.25
Conscientiousness 0.97 0.89 +0.08
Extraversion 0.40 0.45 โˆ’0.05
Agreeableness 0.73 0.78 โˆ’0.05
Neuroticism 0.32 0.13 +0.19

The Neuroticism gap (AO โˆ’ PT = +0.19) matches the direction seen across the 10 other models evaluated in the same way. Gemma differs from Qwen3.6-A3B and GLM-4.5-Air in that its AO reads higher than plaintext on Openness and Conscientiousness โ€” possibly an artifact of the multimodal class wrapping or the seq_len=512 cap during training.

Citation

Karvonen, A. et al. "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers." arXiv:2512.15674 (2025).

Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for swan-0/gemma-4-31b-activation-oracle

Adapter
(108)
this model

Paper for swan-0/gemma-4-31b-activation-oracle