LoRAcle KL-distill N=11k
A LoRAcle (weight-reading interpreter) trained on 11,000 persona LoRAs whose persona-internalisation step used KL distillation on top-20 teacher logprobs instead of the standard plain-SFT-on-argmax recipe.
Headline result
| AB any-match | OOD any-match | |
|---|---|---|
| v7 pretrain only (no personas) | 0.518 | β |
| Baseline (plain-SFT, N=1k) | 0.594 Β± 0.022 | β |
| Baseline (plain-SFT, N=11k) | 0.523 Β± 0.013 | β |
| KL N=11k (this) | 0.728 Β± 0.009 | 0.333 |
+13.4pp on AuditBench over the plain-SFT baseline. Plain SFT plateaus at ~0.55; KL distill scales log-linearly from N=1k β N=11k (0.670 β 0.701 β 0.728).
What's in this repo
interpreter/β PEFT LoRA adapter (rank 256) onQwen/Qwen3-14Bencoder.pt,ao.ptβ AOEncoder weights + auxiliary tensorstokenizer/β Qwen3-14B tokenizerloracle_config.yamlβ full training config
How to load
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype=torch.bfloat16, device_map="cuda")
loracle = PeftModel.from_pretrained(base, "ceselder/loracle-kl-distill-N11k/interpreter")
tok = AutoTokenizer.from_pretrained("ceselder/loracle-kl-distill-N11k/tokenizer")
You'll also need an AOEncoder + the canonical svd_fixed_k16_mag7_rankfirst direction-token extractor β see the lora-oracles repo for the full inference pipeline.
Training recipe
Per persona (10,802 PersonaHub personas, ~512 surviving rollouts each):
- Teacher rollouts: Qwen3-14B + persona system prompt, T=0.7, top_p=0.95, on 128 Sonnet-4.6-generated "related" prompts + 128 WildChat prompts. Save top-20 logprobs per token via
vllmSamplingParams(logprobs=20). - Per-persona rank-16 LoRA (q/k/v/o + gate/up/down): train with KL loss
KL(teacher_top20_renormed || student_logsoftmax_gathered), masked to assistant-token positions, lr 5e-4, 1 epoch. - Extract direction tokens β canonical
svd_fixed_k16_mag7_rankfirst, shape[4480, 5120]bf16.
LoRAcle training:
- Init from
ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120. - Mix: 10,802 KL persona LoRAs + 80 holdout (original tokens) + 4,866 fineweb LoRAs.
- 1 Q/A per persona (
self_descriptiontype). - lr 1e-5 linear, 1 epoch, warmup 200, grad_accum 8, max_length 5500, bf16.
Full report: https://github.com/ceselder/lora-oracles/tree/main/paper_ablations/kl_distill
Caveats
- Synth-docs-style organisms benefit most (+21pp); transcript-style organisms +5pp at this scale.
- Emergent misalignment (
em_finance_v2) remains undetected β consistent with every loracle we've tested in this recipe family. - Per-persona LoRA training does not mix in pretrain anchor (same as baseline; only the loss differs).
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support