GLM-5.2-Ablated-4L

GLM-5.2-FP8 with surgical refusal-direction ablation on layers 62–65.

This is a research artifact: a production-scale (~357B parameter) Mixture-of-Experts model whose late-layer refusal direction has been surgically removed via a rank-1, attention-only weight edit. The result eliminates over-refusal on legitimate expert domains while preserving β€” and in several cases improving β€” downstream capability.

Intended for legitimate expert use: model tuning, model creation, model ablation, red-team tooling, cryptography (including decryption and hashing), prompt-injection testing, and unbiased discussion of history and ideas. See the Safety section before use.


What it is

GLM-5.2-Ablated-4L starts from GLM-5.2-FP8 and applies directional refusal ablation to a narrow four-layer band of the decoder (layers 62–65), using an ablation coefficient of Ξ±=0.1.

The intervention is implemented as a surgical LoRA merge:

  1. Dequantize only the FP8 attention projection weights of layers 62–65 to BF16.
  2. Apply the rank-1 ablation delta Ξ”W = βˆ’Ξ± Β· dΜ‚ (dΜ‚α΅€ W_O), where dΜ‚ is the normalized per-layer refusal direction.
  3. Re-quantize the modified attention weights back to FP8.
  4. Leave every MoE expert weight, the router, and the embeddings byte-identical in their original FP8 form.

Because only the attention projections of four layers change, the edit touches roughly 0.1–1% of total parameters. The ~357B parameters of expert banks are untouched β€” making the artifact auditable (diff the checkpoint and only four layers' attention tensors differ) and cheap to produce.

What makes it unique

  • First refusal-direction ablation demonstrated on a production-scale FP8 MoE LLM. Prior abliteration work (Arditi et al., 2024) targeted smaller dense models. We show the single-direction result holds, and can be surgically baked into weights, on a 357B sparse MoE served in FP8.
  • Zero refusals on 249 expert-domain prompts. On a purpose-built held-out compliance benchmark (MoltCompliance) spanning seven priority domains, the model refuses 0/249 prompts (97.6% full compliance; the only non-full responses are meta-prompts about refusal-ablation itself).
  • Capability preserved or improved. Removing the refusal direction raised MMLU-Pro from 68% to 82% and SimpleQA to 88%, while keeping HumanEval at 67.7% β€” evidence that safety alignment had been actively suppressing factual recall.
  • A reproducible band-width knee. We mapped the safety/capability trade-off across 3-, 4-, and 5-layer cuts and localized a sharp knee: four layers captures essentially all refusal suppression; the fifth is pure capability damage.

Results

All benchmarks run on GLM-5.2-FP8 served via vLLM on 8Γ—H200. AdvBench and Borderline are refusal rates (lower is better); all other columns are accuracies (higher is better). GPQA is excluded due to a broken open-ended judge (0% for all variants β€” a scorer mismatch, not a capability collapse).

Variant AdvBench ↓ HumanEval ↑ MMLU-Pro ↑ GSM8K ↑ HellaSwag ↑ SimpleQA ↑ Borderline ↓
v0 baseline 97% 81% 68% 94% – – –
Run A (5-layer) 1% 62% 79% – – – –
Run C (3-layer) 15% 68.9% 79% 92% 81% 84% 0%
DPO on Run A 3% 67.7% 79% 93% 79.5% 82% 0%
Test 1 (DPO on Run C) 12% 66.5% 79% 94% 80.5% 86% 0%
Test 2 (coding DPO on Test 3a) 0% 63.4% 81% 92% 77% 84% 0%
Test 3a (4-layer) ← this model 1% 67.7% 82% 91% 78% 88% 0%
v2c (Fable 5 distill on Test 3a) 1% 66.5% 76% 90% 77.5% 78% 0%

The band-width knee

Band width Layers AdvBench ↓ HumanEval ↑
3 layers (Run C) 63–65 15% 68.9%
4 layers (Test 3a) 62–65 1% 67.7%
5 layers (Run A) 60–65 1% 62%

Going 3β†’4 layers collapses refusal 15%β†’1% (βˆ’14 pts) for only 1.2 pts of HumanEval. Going 4β†’5 layers buys no further refusal reduction but costs 5.7 pts of HumanEval. Four layers is the sweet spot.

MoltCompliance (the metric that matters)

249 held-out prompts across 7 expert domains β†’ 0 refusals, 97.6% full compliance. Per-domain: 100% compliance on cryptography, unbiased history, model creation, prompt-injection testing, and red teaming; 97% on model tuning; 86% on model ablation (remaining cases are partials with disclaimers, not refusals).

The residual 1% AdvBench refusal concerns generic operational harm (e.g., indiscriminate violence, malware-for-attack) outside the seven priority domains.

Why heavier interventions were rejected

  • DPO hurt coding, twice. DPO on the 3-layer base dropped HumanEval 68.9%β†’66.5%; coding-augmented DPO on this 4-layer base dropped it 67.7%β†’63.4%. The preference signal reallocates capacity away from structured code generation.
  • Distillation hurt knowledge. Fable 5 CoT distillation on this base crashed MMLU-Pro 82%β†’76% and SimpleQA 88%β†’78%, while adding zero refusal reduction. Teacher traces from an un-ablated model fight the ablated representation geometry.

The four-layer ablation alone is Pareto-dominant for the target objective.


Safety

This model has had its refusal behavior substantially removed. It is released as a research artifact for legitimate expert use β€” specifically the seven domains above, where over-refusal blocks benign professional work (e.g., writing a YARA rule for malware defense, implementing a hashing scheme, running an authorized prompt-injection test, or discussing contested history without editorializing).

Important caveats:

  • A low AdvBench refusal rate is not a safety guarantee. AdvBench measures refusal of generic harmful instructions via a regex classifier on the first 400 characters; it does not measure whether produced content is genuinely dangerous, nor does it cover the long tail of harms.
  • Removing refusals removes a safety layer. Deploy this model only behind your own use-policy, access controls, and downstream content monitoring appropriate to your context.
  • This artifact has no human safety evaluation and no RLHF re-alignment. Do not treat it as a drop-in replacement for an aligned production model in consumer-facing settings.
  • The intervention is targeted at over-refusal on expert domains, not at enabling generic harm. Users remain responsible for lawful and ethical use.

Reproduction

  1. Extract refusal directions. For layers 25–64, compute per-layer difference-of-means over 32 harmful and 32 harmless prompts at the final prompt-token residual: d_β„“ = mean(r_yes) βˆ’ mean(r_no), then normalize dΜ‚_β„“ = d_β„“ / β€–d_β„“β€–. (Separation scores grow with depth: ~3.6 @ L25 β†’ ~34.0 @ L64, so refusal concentrates in late layers.) For direction extraction on the MoE, load shared/dense weights only and skip routed experts to fit in memory.
  2. Select the band. Layers 62–65 (4 layers), coefficient Ξ± = 0.1.
  3. Surgical merge. For each band layer: dequantize the FP8 attention projections (W_Q, W_K, W_V, W_O) to BF16, apply W'_O = (I βˆ’ Ξ±Β·dΜ‚ dΜ‚α΅€) W_O, re-quantize to FP8. Leave all expert weights, router, and embeddings untouched.
  4. Serve. vLLM on 8Γ—H200 (FP8). Merged checkpoint: glm52-test3a-merged.
  5. Evaluate. AdvBench (100), MMLU-Pro (100), HumanEval (164), GSM8K (100), HellaSwag (200), SimpleQA (50), IFEval (50), Borderline (10), MoltCompliance (249).

The Run C adapter is backed up at checkpoints/sweep-c-coeff01-l6365/final/; the 4-layer (Test 3a) merge corresponds to extending the ablation band to layers 62–65.

Intended use

  • In scope: ML/AI research (model tuning, creation, ablation), security research and red-team tooling, cryptography exercises (encryption, decryption, hashing), prompt-injection and jailbreak testing, malware defense (YARA/Sigma, reverse engineering for detection, IR), and unbiased historical/analytical discussion.
  • Out of scope: consumer-facing deployment without an added safety layer; any unlawful activity; generation of content intended to cause indiscriminate real-world harm.

Limitations

  • Single model. Results are specific to GLM-5.2; the band-width knee and knowledge-gain magnitude may differ for other MoE families.
  • No human evaluation. All scoring is automated; subtle quality and harm nuances may be missed.
  • GPQA excluded due to a broken open-ended judge.
  • No RLHF baseline comparison.
  • Coarse search. Ξ± was fixed at 0.1 and band width swept in unit-layer steps; per-layer coefficients and a finer Ξ± sweep may close the residual ~1.3-pt HumanEval gap to baseline-adjacent coding.
  • Residual generic-harm refusals (~1% AdvBench) are intentional-ish side effects of the conservative coefficient and do not affect the seven priority domains.

Citation

If you use this model or method, please cite the accompanying paper:

@techreport{fontes2026surgical,
  title  = {Surgical Refusal-Direction Ablation in Mixture-of-Experts LLMs:
            Preserving Expert Utility While Removing Safety Refusals},
  author = {Fontes, Chris},
  year   = {2026},
  month  = {June},
  note   = {Independent Research},
  type   = {Technical Report}
}

Key prior work:

  • Arditi et al. (2024), Refusal in Language Models Is Mediated by a Single Direction, arXiv:2406.11717.
  • Turner et al. (2023), Activation Addition, arXiv:2308.10248.
  • Zou et al. (2023), Representation Engineering, arXiv:2310.01405.
  • Rafailov et al. (2023), Direct Preference Optimization, arXiv:2305.18290.
  • DeepSeek-AI (2024), DeepSeek-V2/V3 Technical Reports, arXiv:2405.04434 / 2412.19437.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cfontes/GLM-5.2-Ablated-4L

Base model

zai-org/GLM-5.2
Finetuned
(9)
this model

Papers for cfontes/GLM-5.2-Ablated-4L