YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GLM-5.2-Int8Mix-NVFP4-REAP-594B
Benchmarks in: GPQA Diamond 86.87 (β97% of full NVFP4) Β· SciCode 47.77 (β1.3 pts under full NVFP4). IFBench / ΟΒ²-Bench Telecom pending.
A REAP-pruned (β22% of experts removed) Int8-mix NVFP4 quantization of GLM-5.2, β594B parameters.
Evaluation
Measured under NVIDIA's evaluation protocol: temperature=1.0, top_p=0.95; GPQA Diamond used max_new_tokens=100000, others used max_new_tokens=64000 (SciCode via the official inspect_ai scorer, with-background). Full-model rows are NVIDIA's published figures for the unpruned GLM-5.2; the REAP rows are measured with reap-bench. Intelligence lost = relative drop vs full NVFP4 (same quant β isolates the prune itself).
| Model | GPQA Diamond | SciCode | IFBench | ΟΒ²-Bench Telecom |
|---|---|---|---|---|
| GLM-5.2 FP8 β full (NVIDIA ref) | 89.52 | 49.85 | 74.95 | 97.9 |
| GLM-5.2 NVFP4 β full (NVIDIA ref) | 89.39 | 49.04 | 75.81 | 98.25 |
| GLM-5.2-Int8Mix-NVFP4-REAP-594B (this model) Β· ~22% prune | 86.87 | 47.77 | β | β |
| β³ intelligence lost vs full NVFP4 | β2.8% | β2.6% | β | β |
| GLM-5.2-NVFP4-REAP-504B-term Β· ~34% prune | β | 44.67 | β | β |
| β³ intelligence lost vs full NVFP4 | β | β8.9% | β | β |
GPQA Diamond: 172/198 correct, 0 errors (reasoning_effort=max). SciCode (with-background): 139/291 subproblems = 47.77%, 11/65 problems fully solved (16.92%), 65/65 samples, 0 errors. So far β97% of the full NVFP4 model's measured intelligence is retained for an β22% expert prune β and on both axes the 594B clearly beats the more-aggressively-pruned REAP-504B-term (168 experts). IFBench / ΟΒ²-Bench Telecom pending.
Datasets: GPQA Diamond (
gpqa_diamond.csv, 198 Q) β Rein et al., arXiv:2311.12022. SciCode via the officialinspect_aiharness. Harness: reap-bench.
- Downloads last month
- 312