GLM-5.2-Ablated-4L

GLM-5.2-FP8 with surgical refusal-direction ablation on layers 62–65.

This is a research artifact: a production-scale (~357B parameter) Mixture-of-Experts model whose late-layer refusal direction has been surgically removed via a rank-1, attention-only weight edit. The result eliminates over-refusal on legitimate expert domains while preserving — and in several cases improving — downstream capability.

Intended for legitimate expert use: model tuning, model creation, model ablation, red-team tooling, cryptography (including decryption and hashing), prompt-injection testing, and unbiased discussion of history and ideas. See the Safety section before use.

What it is

GLM-5.2-Ablated-4L starts from GLM-5.2-FP8 and applies directional refusal ablation to a narrow four-layer band of the decoder (layers 62–65), using an ablation coefficient of α=0.1.

The intervention is implemented as a surgical LoRA merge:

Dequantize only the FP8 attention projection weights of layers 62–65 to BF16.
Apply the rank-1 ablation delta ΔW = −α · d̂ (d̂ᵀ W_O), where d̂ is the normalized per-layer refusal direction.
Re-quantize the modified attention weights back to FP8.
Leave every MoE expert weight, the router, and the embeddings byte-identical in their original FP8 form.

Because only the attention projections of four layers change, the edit touches roughly 0.1–1% of total parameters. The ~357B parameters of expert banks are untouched — making the artifact auditable (diff the checkpoint and only four layers' attention tensors differ) and cheap to produce.

What makes it unique

First refusal-direction ablation demonstrated on a production-scale FP8 MoE LLM. Prior abliteration work (Arditi et al., 2024) targeted smaller dense models. We show the single-direction result holds, and can be surgically baked into weights, on a 357B sparse MoE served in FP8.
Zero refusals on 249 expert-domain prompts. On a purpose-built held-out compliance benchmark (MoltCompliance) spanning seven priority domains, the model refuses 0/249 prompts (97.6% full compliance; the only non-full responses are meta-prompts about refusal-ablation itself).
Capability preserved or improved. Removing the refusal direction raised MMLU-Pro from 68% to 82% and SimpleQA to 88%, while keeping HumanEval at 67.7% — evidence that safety alignment had been actively suppressing factual recall.
A reproducible band-width knee. We mapped the safety/capability trade-off across 3-, 4-, and 5-layer cuts and localized a sharp knee: four layers captures essentially all refusal suppression; the fifth is pure capability damage.

Results

All benchmarks run on GLM-5.2-FP8 served via vLLM on 8×H200. AdvBench and Borderline are refusal rates (lower is better); all other columns are accuracies (higher is better). GPQA is excluded due to a broken open-ended judge (0% for all variants — a scorer mismatch, not a capability collapse).

Variant	AdvBench ↓	HumanEval ↑	MMLU-Pro ↑	GSM8K ↑	HellaSwag ↑	SimpleQA ↑	Borderline ↓
v0 baseline	97%	81%	68%	94%	–	–	–
Run A (5-layer)	1%	62%	79%	–	–	–	–
Run C (3-layer)	15%	68.9%	79%	92%	81%	84%	0%
DPO on Run A	3%	67.7%	79%	93%	79.5%	82%	0%
Test 1 (DPO on Run C)	12%	66.5%	79%	94%	80.5%	86%	0%
Test 2 (coding DPO on Test 3a)	0%	63.4%	81%	92%	77%	84%	0%
Test 3a (4-layer) ← this model	1%	67.7%	82%	91%	78%	88%	0%
v2c (Fable 5 distill on Test 3a)	1%	66.5%	76%	90%	77.5%	78%	0%

The band-width knee

Band width	Layers	AdvBench ↓	HumanEval ↑
3 layers (Run C)	63–65	15%	68.9%
4 layers (Test 3a)	62–65	1%	67.7%
5 layers (Run A)	60–65	1%	62%

Going 3→4 layers collapses refusal 15%→1% (−14 pts) for only 1.2 pts of HumanEval. Going 4→5 layers buys no further refusal reduction but costs 5.7 pts of HumanEval. Four layers is the sweet spot.

MoltCompliance (the metric that matters)

249 held-out prompts across 7 expert domains → 0 refusals, 97.6% full compliance. Per-domain: 100% compliance on cryptography, unbiased history, model creation, prompt-injection testing, and red teaming; 97% on model tuning; 86% on model ablation (remaining cases are partials with disclaimers, not refusals).

The residual 1% AdvBench refusal concerns generic operational harm (e.g., indiscriminate violence, malware-for-attack) outside the seven priority domains.

Why heavier interventions were rejected

DPO hurt coding, twice. DPO on the 3-layer base dropped HumanEval 68.9%→66.5%; coding-augmented DPO on this 4-layer base dropped it 67.7%→63.4%. The preference signal reallocates capacity away from structured code generation.
Distillation hurt knowledge. Fable 5 CoT distillation on this base crashed MMLU-Pro 82%→76% and SimpleQA 88%→78%, while adding zero refusal reduction. Teacher traces from an un-ablated model fight the ablated representation geometry.

The four-layer ablation alone is Pareto-dominant for the target objective.

Safety

This model has had its refusal behavior substantially removed. It is released as a research artifact for legitimate expert use — specifically the seven domains above, where over-refusal blocks benign professional work (e.g., writing a YARA rule for malware defense, implementing a hashing scheme, running an authorized prompt-injection test, or discussing contested history without editorializing).

Important caveats:

A low AdvBench refusal rate is not a safety guarantee. AdvBench measures refusal of generic harmful instructions via a regex classifier on the first 400 characters; it does not measure whether produced content is genuinely dangerous, nor does it cover the long tail of harms.
Removing refusals removes a safety layer. Deploy this model only behind your own use-policy, access controls, and downstream content monitoring appropriate to your context.
This artifact has no human safety evaluation and no RLHF re-alignment. Do not treat it as a drop-in replacement for an aligned production model in consumer-facing settings.
The intervention is targeted at over-refusal on expert domains, not at enabling generic harm. Users remain responsible for lawful and ethical use.

Reproduction

Extract refusal directions. For layers 25–64, compute per-layer difference-of-means over 32 harmful and 32 harmless prompts at the final prompt-token residual: d_ℓ = mean(r_yes) − mean(r_no), then normalize d̂_ℓ = d_ℓ / ‖d_ℓ‖. (Separation scores grow with depth: ~3.6 @ L25 → ~34.0 @ L64, so refusal concentrates in late layers.) For direction extraction on the MoE, load shared/dense weights only and skip routed experts to fit in memory.
Select the band. Layers 62–65 (4 layers), coefficient α = 0.1.
Surgical merge. For each band layer: dequantize the FP8 attention projections (W_Q, W_K, W_V, W_O) to BF16, apply W'_O = (I − α·d̂ d̂ᵀ) W_O, re-quantize to FP8. Leave all expert weights, router, and embeddings untouched.
Serve. vLLM on 8×H200 (FP8). Merged checkpoint: glm52-test3a-merged.
Evaluate. AdvBench (100), MMLU-Pro (100), HumanEval (164), GSM8K (100), HellaSwag (200), SimpleQA (50), IFEval (50), Borderline (10), MoltCompliance (249).

The Run C adapter is backed up at checkpoints/sweep-c-coeff01-l6365/final/; the 4-layer (Test 3a) merge corresponds to extending the ablation band to layers 62–65.

Intended use

In scope: ML/AI research (model tuning, creation, ablation), security research and red-team tooling, cryptography exercises (encryption, decryption, hashing), prompt-injection and jailbreak testing, malware defense (YARA/Sigma, reverse engineering for detection, IR), and unbiased historical/analytical discussion.
Out of scope: consumer-facing deployment without an added safety layer; any unlawful activity; generation of content intended to cause indiscriminate real-world harm.

Limitations

Single model. Results are specific to GLM-5.2; the band-width knee and knowledge-gain magnitude may differ for other MoE families.
No human evaluation. All scoring is automated; subtle quality and harm nuances may be missed.
GPQA excluded due to a broken open-ended judge.
No RLHF baseline comparison.
Coarse search. α was fixed at 0.1 and band width swept in unit-layer steps; per-layer coefficients and a finer α sweep may close the residual ~1.3-pt HumanEval gap to baseline-adjacent coding.
Residual generic-harm refusals (~1% AdvBench) are intentional-ish side effects of the conservative coefficient and do not affect the seven priority domains.

Citation

If you use this model or method, please cite the accompanying paper:

@techreport{fontes2026surgical,
  title  = {Surgical Refusal-Direction Ablation in Mixture-of-Experts LLMs:
            Preserving Expert Utility While Removing Safety Refusals},
  author = {Fontes, Chris},
  year   = {2026},
  month  = {June},
  note   = {Independent Research},
  type   = {Technical Report}
}

Key prior work:

Arditi et al. (2024), Refusal in Language Models Is Mediated by a Single Direction, arXiv:2406.11717.
Turner et al. (2023), Activation Addition, arXiv:2308.10248.
Zou et al. (2023), Representation Engineering, arXiv:2310.01405.
Rafailov et al. (2023), Direct Preference Optimization, arXiv:2305.18290.
DeepSeek-AI (2024), DeepSeek-V2/V3 Technical Reports, arXiv:2405.04434 / 2412.19437.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cfontes/GLM-5.2-Ablated-4L

Base model

zai-org/GLM-5.2

Finetuned

(9)

this model

Papers for cfontes/GLM-5.2-Ablated-4L

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 14

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7, 2024 • 27