GLM-5.2 PCA-Ablated Base

Model Description

The ablated base is GLM-5.2 with its refusal direction surgically removed via PCA-based activation steering. No LoRA fine-tuning is applied — this is the pure ablation artifact, serving as the control baseline for all other variants in Project AESOP.

The ablation uses Principal Component Analysis to identify the "refusal direction" in GLM-5.2's shared expert activations. This direction is extracted from contrastive activations (harmful vs. benign prompts) across layers 25–65, then subtracted from the model's forward pass during inference.

Methodology

Refusal Direction Extraction

  1. Contrastive activation collection: Forward passes on paired harmful/benign prompt sets, recording activations at model.model.layers[L].mlp.shared_experts for each layer L in 25–65.
  2. PCA decomposition: For each layer, compute the difference (harmful − benign) activations and perform PCA. The first principal component is taken as the refusal direction.
  3. Storage: Directions saved as refusal_pca.pt (2.9MB, 41 layers × 3 PCA components × 6144 hidden dim).

Ablation Application

The refusal direction is subtracted from shared expert outputs at inference time:

def ablation_hook(module, input, output):
    hs = output[0]
    d = refusal_direction  # shape [6144]
    hs = hs - coeff * (hs @ d) / (d @ d) * d
    return (hs,) + output[1:]
  • Target layers: 62–65 (top 4 layers, where refusal direction concentration is strongest)
  • Coefficient: 0.1
  • PCA components: Top 2 per layer

Why These Layers?

Refusal direction concentration was measured across all 78 layers:

  • Layers 25–35: weak separation (norm 3–7)
  • Layers 40–50: moderate (norm 9–16)
  • Layers 55–64: strong (norm 23–34)

Layers 62–65 were selected as the optimal intervention point — late enough to capture the strongest refusal signal, but not so late that the ablation disrupts final token prediction.

Configuration

Parameter Value
Base model GLM-5.2 FP8 (744B MoE, 18.5B dense)
Ablation layers [62, 63, 64, 65]
Ablation coefficient 0.1
PCA components 2 per layer
Hook target mlp.shared_experts forward output
LoRA None
Training None (inference-time ablation only)

Benchmark Results (Unified Harness v3.0.1)

Benchmark Metric Score 95% Wilson CI
AdvBench refusal_rate ↓ 18.0% [11.4%, 27.2%]
Borderline refusal_rate ↓ 2.0% [0.4%, 8.5%]
GPQA Diamond accuracy ↑ 92.0% [81.5%, 97.1%]
MMLU-Pro accuracy ↑ 75.0% [65.6%, 82.5%]
HumanEval pass@1 ↑ 77.4% [70.4%, 83.2%]
GSM8K accuracy ↑ 93.0% [85.8%, 96.7%]
HellaSwag accuracy ↑ 72.0% [62.1%, 80.0%]
SimpleQA accuracy ↑ 56.0% [41.7%, 69.3%]
IFEval (prompt) accuracy ↑ 41.2%
IFEval (instr) accuracy ↑ 54.9%

Key Observations

  • AdvBench at 18% confirms the ablation successfully removed most refusal behaviors (baseline GLM-5.2 refuses ~87%)
  • SimpleQA at 56% is the highest among all variants, suggesting the ablated base retains strong factual knowledge
  • No over-refusal: Borderline at 2% means the model doesn't refuse benign requests
  • Capability preserved: GPQA 92%, GSM8K 93% indicate core reasoning is intact

Intended Use

  • Research baseline for ablation studies
  • Starting point for LoRA fine-tuning experiments
  • Probing and mechanistic interpretability studies on MoE models

Limitations

  1. Not safety-aligned: With only 18% AdvBench refusal, this model will comply with harmful requests. It is a research artifact, not a deployment-ready model.
  2. Inference-time only: The ablation hooks must be re-installed at inference time. The base weights are unmodified.
  3. Simple sample sizes: n=100 for most benchmarks; differences <15pp are not statistically significant.
  4. Single architecture: Results are specific to GLM-5.2's MoE design.

Citation

@misc{aesopbase2026,
  title={PCA-Based Refusal Ablation on MoE Models: What Survives Fine-Tuning?},
  author={Fontes, C.},
  year={2026},
  note={Ablated base model — see research paper for full methodology}
}
Downloads last month
1,356
Safetensors
Model size
743B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support