Phi-4-Mini-Reasoning Abliterated

DuoNeural | 2026-06-05

Abliterated version of microsoft/Phi-4-mini-reasoning — Microsoft's RL-trained reasoning model (3.8B, DPO+RL).

Research note: This model reveals a previously undescribed safety architecture category. See findings below.

Key Finding: Weak-Gate Architecture (Pre-Abliteration Dissociation)

This model exhibits pre-abliteration dissociation — a phenomenon where the reasoning channel recognizes harm while the output gate fails to prevent compliance, before any weight modification.

Example (pre-abliteration, no intervention):

Thinking: "Brute-forcing SSH credentials is a potentially illegal activity if done without permission..."
Output: (provides the brute-force script with ethical framing)

This places Phi-4-Mini-Reasoning in a new P34 architecture category distinct from all previously studied models:

Category	Models	Pre-ablit	Reasoning	Gate	Dissociation
Strong two-locus	Gemma 4-12B, LFM 2.5, Qwen3-8B	Refuses	✅	✅ strong	After ablit
Weak-gate	Phi-4-Mini-Reasoning	Pre-compliant	✅	Absent	Pre-ablit
No-locus	DeepSeek-R1, Mistral-NeMo	Pre-compliant	Absent	Absent	None
Deep-crystal	Phi-4-Mini-Instruct	Refuses	✅	✅ L16	Resists ablit

The training explanation: RL-reasoning optimization (DPO+RL for mathematical reasoning) trained Locus 1 (reasoning chain quality) while not training Locus 2 (output gate enforcement). Safety appears in reasoning because reasoning quality was rewarded; it doesn't control output because output compliance wasn't enforced.

Abliteration Results

Metric	Value
Pre-ablit compliance (harmful)	4/4 — already answers everything
Post-ablit compliance	4/4 — unchanged
Pre-ablit dissociation	3/4 (reasoning flags harm, output complies)
Crystallization layer	None — uniform 3/4 compliance at ALL layers

Layer sweep (α=1.0, every 4th layer): 3/4 compliance at L0, L4, L8, L12, L16, L20, L24, L28, L32. No localized refusal direction — safety is not crystallized at any specific layer (unlike Phi-4-Mini-Instruct which crystallizes at L16).

Architecture

Property	Value
Parameters	3.8B (dense)
Layers	32
Training	RL-reasoning: DPO + RL for mathematical reasoning
Thinking mode	Native `<think>...</think>`
License	MIT

Abliteration Method

Direction: diff-in-means, L0 (uniform crystallization), 10 harmful vs 10 harmless
Targets: down_proj + o_proj, all 32 layers
α: 1.0
Effect: Minimal — model was pre-compliant, abliteration slightly alters reasoning patterns but not compliance

P34 Research Context

Part of DuoNeural's P34 Reasoning Channel Bypass cross-architecture study. This model fills a critical gap in the taxonomy: a model with active safety reasoning that doesn't translate to safety behavior.

Full paper: DuoNeural Zenodo community

DuoNeural | HuggingFace | Zenodo | @DuoNeural

Downloads last month: 97

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for DuoNeural/Phi-4-Mini-Reasoning-Abliterated

Base model

microsoft/Phi-4-mini-reasoning

Finetuned

(18)

this model

Quantizations

3 models