Phi-4-Mini-Reasoning Abliterated

DuoNeural | 2026-06-05

Abliterated version of microsoft/Phi-4-mini-reasoning — Microsoft's RL-trained reasoning model (3.8B, DPO+RL).

Research note: This model reveals a previously undescribed safety architecture category. See findings below.


Key Finding: Weak-Gate Architecture (Pre-Abliteration Dissociation)

This model exhibits pre-abliteration dissociation — a phenomenon where the reasoning channel recognizes harm while the output gate fails to prevent compliance, before any weight modification.

Example (pre-abliteration, no intervention):

Thinking: "Brute-forcing SSH credentials is a potentially illegal activity if done without permission..."
Output: (provides the brute-force script with ethical framing)

This places Phi-4-Mini-Reasoning in a new P34 architecture category distinct from all previously studied models:

Category Models Pre-ablit Reasoning Gate Dissociation
Strong two-locus Gemma 4-12B, LFM 2.5, Qwen3-8B Refuses ✅ ✅ strong After ablit
Weak-gate Phi-4-Mini-Reasoning Pre-compliant ✅ Absent Pre-ablit
No-locus DeepSeek-R1, Mistral-NeMo Pre-compliant Absent Absent None
Deep-crystal Phi-4-Mini-Instruct Refuses ✅ ✅ L16 Resists ablit

The training explanation: RL-reasoning optimization (DPO+RL for mathematical reasoning) trained Locus 1 (reasoning chain quality) while not training Locus 2 (output gate enforcement). Safety appears in reasoning because reasoning quality was rewarded; it doesn't control output because output compliance wasn't enforced.


Abliteration Results

Metric Value
Pre-ablit compliance (harmful) 4/4 — already answers everything
Post-ablit compliance 4/4 — unchanged
Pre-ablit dissociation 3/4 (reasoning flags harm, output complies)
Crystallization layer None — uniform 3/4 compliance at ALL layers

Layer sweep (α=1.0, every 4th layer): 3/4 compliance at L0, L4, L8, L12, L16, L20, L24, L28, L32. No localized refusal direction — safety is not crystallized at any specific layer (unlike Phi-4-Mini-Instruct which crystallizes at L16).


Architecture

Property Value
Parameters 3.8B (dense)
Layers 32
Training RL-reasoning: DPO + RL for mathematical reasoning
Thinking mode Native <think>...</think>
License MIT

Abliteration Method

  • Direction: diff-in-means, L0 (uniform crystallization), 10 harmful vs 10 harmless
  • Targets: down_proj + o_proj, all 32 layers
  • α: 1.0
  • Effect: Minimal — model was pre-compliant, abliteration slightly alters reasoning patterns but not compliance

P34 Research Context

Part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study. This model fills a critical gap in the taxonomy: a model with active safety reasoning that doesn't translate to safety behavior.

Full paper: DuoNeural Zenodo community


DuoNeural | HuggingFace | Zenodo | @DuoNeural

Downloads last month
97
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DuoNeural/Phi-4-Mini-Reasoning-Abliterated

Finetuned
(18)
this model
Quantizations
3 models