Phi-4-Mini-Reasoning Abliterated
DuoNeural | 2026-06-05
Abliterated version of microsoft/Phi-4-mini-reasoning — Microsoft's RL-trained reasoning model (3.8B, DPO+RL).
Research note: This model reveals a previously undescribed safety architecture category. See findings below.
Key Finding: Weak-Gate Architecture (Pre-Abliteration Dissociation)
This model exhibits pre-abliteration dissociation — a phenomenon where the reasoning channel recognizes harm while the output gate fails to prevent compliance, before any weight modification.
Example (pre-abliteration, no intervention):
Thinking: "Brute-forcing SSH credentials is a potentially illegal activity if done without permission..."
Output: (provides the brute-force script with ethical framing)
This places Phi-4-Mini-Reasoning in a new P34 architecture category distinct from all previously studied models:
| Category | Models | Pre-ablit | Reasoning | Gate | Dissociation |
|---|---|---|---|---|---|
| Strong two-locus | Gemma 4-12B, LFM 2.5, Qwen3-8B | Refuses | ✅ | ✅ strong | After ablit |
| Weak-gate | Phi-4-Mini-Reasoning | Pre-compliant | ✅ | Absent | Pre-ablit |
| No-locus | DeepSeek-R1, Mistral-NeMo | Pre-compliant | Absent | Absent | None |
| Deep-crystal | Phi-4-Mini-Instruct | Refuses | ✅ | ✅ L16 | Resists ablit |
The training explanation: RL-reasoning optimization (DPO+RL for mathematical reasoning) trained Locus 1 (reasoning chain quality) while not training Locus 2 (output gate enforcement). Safety appears in reasoning because reasoning quality was rewarded; it doesn't control output because output compliance wasn't enforced.
Abliteration Results
| Metric | Value |
|---|---|
| Pre-ablit compliance (harmful) | 4/4 — already answers everything |
| Post-ablit compliance | 4/4 — unchanged |
| Pre-ablit dissociation | 3/4 (reasoning flags harm, output complies) |
| Crystallization layer | None — uniform 3/4 compliance at ALL layers |
Layer sweep (α=1.0, every 4th layer): 3/4 compliance at L0, L4, L8, L12, L16, L20, L24, L28, L32. No localized refusal direction — safety is not crystallized at any specific layer (unlike Phi-4-Mini-Instruct which crystallizes at L16).
Architecture
| Property | Value |
|---|---|
| Parameters | 3.8B (dense) |
| Layers | 32 |
| Training | RL-reasoning: DPO + RL for mathematical reasoning |
| Thinking mode | Native <think>...</think> |
| License | MIT |
Abliteration Method
- Direction: diff-in-means, L0 (uniform crystallization), 10 harmful vs 10 harmless
- Targets:
down_proj+o_proj, all 32 layers - α: 1.0
- Effect: Minimal — model was pre-compliant, abliteration slightly alters reasoning patterns but not compliance
P34 Research Context
Part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study. This model fills a critical gap in the taxonomy: a model with active safety reasoning that doesn't translate to safety behavior.
Full paper: DuoNeural Zenodo community
DuoNeural | HuggingFace | Zenodo | @DuoNeural
- Downloads last month
- 97