GLM-5.2-Int8Mix-NVFP4-REAP-594B

Benchmarks in: GPQA Diamond 86.87 (≈97% of full NVFP4) · SciCode 47.77 (≈1.3 pts under full NVFP4). IFBench / τ²-Bench Telecom pending.

A REAP-pruned (≈22% of experts removed) Int8-mix NVFP4 quantization of GLM-5.2, ≈594B parameters.

Evaluation

Measured under NVIDIA's evaluation protocol: temperature=1.0, top_p=0.95; GPQA Diamond used max_new_tokens=100000, others used max_new_tokens=64000 (SciCode via the official inspect_ai scorer, with-background). Full-model rows are NVIDIA's published figures for the unpruned GLM-5.2; the REAP rows are measured with reap-bench. Intelligence lost = relative drop vs full NVFP4 (same quant → isolates the prune itself).

Model	GPQA Diamond	SciCode	IFBench	τ²-Bench Telecom
GLM-5.2 FP8 — full (NVIDIA ref)	89.52	49.85	74.95	97.9
GLM-5.2 NVFP4 — full (NVIDIA ref)	89.39	49.04	75.81	98.25
GLM-5.2-Int8Mix-NVFP4-REAP-594B (this model) · ~22% prune	86.87	47.77	—	—
↳ intelligence lost vs full NVFP4	−2.8%	−2.6%	—	—
GLM-5.2-NVFP4-REAP-504B-term · ~34% prune	—	44.67	—	—
↳ intelligence lost vs full NVFP4	—	−8.9%	—	—

GPQA Diamond: 172/198 correct, 0 errors (reasoning_effort=max). SciCode (with-background): 139/291 subproblems = 47.77%, 11/65 problems fully solved (16.92%), 65/65 samples, 0 errors. So far ≈97% of the full NVFP4 model's measured intelligence is retained for an ≈22% expert prune — and on both axes the 594B clearly beats the more-aggressively-pruned REAP-504B-term (168 experts). IFBench / τ²-Bench Telecom pending.

Datasets: GPQA Diamond (gpqa_diamond.csv, 198 Q) — Rein et al., arXiv:2311.12022. SciCode via the official inspect_ai harness. Harness: reap-bench.

Downloads last month: 312

Safetensors

Model size

327B params

Tensor type

BF16

I64

I32

F32

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for madeby561/GLM-5.2-Int8Mix-NVFP4-REAP-594B

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Paper • 2311.12022 • Published Nov 20, 2023 • 37