GLM-5.2 โ€” Q3_K / Q8_0 GGUF (CPU-expert quant for older hardware)

GGUF quantization of zai-org/GLM-5.2 (745B total / 40B active, glm-dsa MoE with MLA attention) built for one specific job: fast CPU-expert MoE inference on older / "e-waste" hardware โ€” a dual-socket Xeon (or similar) with lots of RAM and a single modest GPU.

The routed experts are quantized to Q3_K (a plain K-quant) instead of the codebook IQ-quants used by the popular "dynamic" packs, because K-quant dequant is much faster on pre-AVX-512 CPUs โ€” and on a CPU-expert setup that dequant is the decode bottleneck. Same size, equal-or-better quality, faster decode. See below.


Quick start

Routed experts in system RAM, attention + KV on one GPU. ngram speculative decoding is on by default โ€” it's a large win on repetitive / code / structured output and harmless on prose (see Speculative decoding):

GGML_CUDA_NO_PINNED=1 numactl --interleave=all \
./llama-server \
  --model GLM-5.2-Q3_K-Q8_0-00001-of-0000N.gguf \
  -ngl 999 -ot 'ffn_.*_exps=CPU' \
  -fa on -ctk q8_0 -ctv q8_0 -c 16384 \
  -t 42 \
  --spec-type ngram-cache \
  --jinja
  • -ot 'ffn_.*_exps=CPU' keeps the 256 routed experts in RAM while attention, KV, the shared expert and the router stay on the GPU (~18 GB VRAM). The shared expert fires every token โ€” it belongs in HBM, not on the CPU path. Equivalents: --cpu-moe (same placement, one flag); --n-cpu-moe N (keep only the first N layers' experts on CPU, offload the rest if you have spare VRAM).
  • -t 42 โ‰ˆ physical cores + some hyperthreads on a 2ร—14c box โ€” tune to your core count.
  • GGML_CUDA_NO_PINNED=1 avoids pinning ~300 GB of host memory; numactl --interleave=all helps on dual-socket boxes.
  • All-CPU (no GPU) also works: drop the GPU flags and use -ngl 0.
  • Runs on llama.cpp and ik_llama.cpp (for ik_llama.cpp, swap the speculator: --spec-type suffix).

Do not add --run-time-repack / -rtr. It allocates a second full copy of the model in RAM, which page-thrashes (and can OOM) on the RAM-constrained boxes this quant targets, for no decode benefit at batch=1.


Speculative decoding

This is a 256-expert / 8-active MoE with the experts on the CPU, which makes the kind of speculation matter a lot.

ngram (recommended, on by default)

--spec-type ngram-cache (llama.cpp) / --spec-type suffix (ik_llama.cpp) drafts tokens from the recent context โ€” no draft model, no extra weights. On repeated spans the drafted tokens route to the same experts already in the verify batch, so the MoE verify cost doesn't blow up. It's workload-gated โ€” it fires only when the output actually repeats:

output type speedup vs no-spec
verbatim / highly repetitive +80 %
CSV / structured records +52 %
boilerplate / templated code +37 %
general prose ~0 % (never fires)
novel, non-repetitive code โˆ’5 %

On the reference box that's roughly 3.5 โ†’ 5โ€“7 tok/s on agentic / code-echo / templated workloads. It costs ~5 % on novel code (it drafts on partial matches that miss), so if your traffic is purely free-form prose/code you can drop --spec-type ngram-cache; for agentic, tool-use, refactoring, structured-output, or any repetitive workload, leave it on.

MTP โ€” DO NOT use on this model

The GGUF retains the nextn / MTP head (blk.N.nextn.*) and both engines support --spec-type draft-mtp, but it is a hard loss here: ~โˆ’50 % at every draft depth, even at 100 % draft acceptance. The batched verify of N drafted tokens reads the union of their routed experts (8 active ร— N mostly-disjoint sets), so each speculated token ~doubles the per-step expert traffic โ€” acceptance can't pay that back. Don't enable MTP for this MoE.


Why Q3_K experts

The popular "dynamic" Q3 packs (e.g. UD-Q3_K_XL) quantize the routed experts with IQ3_XXS / IQ4_XS โ€” codebook quants that are excellent for size-per-quality, but whose dequant relies on a 256-entry grid gather. On pre-AVX-512 CPUs (Haswell/Broadwell and older) there is no fast gather โ€” it's emulated โ€” so on a CPU-expert setup that codebook dequant becomes the decode bottleneck.

This quant uses Q3_K for the experts (shift/mask + a block scale โ€” no codebook, no gather) and Q8_0 for everything else. At the same size it's measurably faster to decode with equal-or-better quality, and the gap widens on weaker/older CPUs.

Composition

tensors type
routed experts (ffn_*_exps) Q3_K (~3.44 bpw)
MLA attention, shared expert, dense FFN, token/output embeddings, norms Q8_0
  • imatrix: computed on wikitext-2 (200 ร— 512-token chunks), applied to the experts.
  • Size: ~291 GiB. Needs ~300 GB RAM for the experts (plus a GPU or more RAM for attention/KV).

Quality โ€” perplexity (matches the IQ3 pack, edge to this one)

wikitext-2 test, ctx 512, 100 chunks, identical all-CPU config:

quant experts PPL โ†“
unsloth/UD-Q3_K_XL IQ3_XXS / IQ4_XS 2.8784 ยฑ 0.036
this quant โ€” RTN (no imatrix) Q3_K 2.8798 ยฑ 0.036
this quant โ€” imatrix Q3_K + imatrix 2.8265 ยฑ 0.035

The imatrix Q3_K/Q8_0 quant matches and slightly beats the IQ3 dynamic pack at the same size, and the imatrix is a clean improvement over RTN (2.880 โ†’ 2.827).

Decode speed (the +5 %)

Reference box: 2ร— Xeon E5-2690 v4 (14c each, AVX2, no AVX-512), 1ร— MI100 holding attention + KV, all 256 experts in system RAM:

quant decode tok/s โ†‘
unsloth/UD-Q3_K_XL (IQ experts) 3.35
this quant (Q3_K experts) 3.53 (+5 %)

The +5 % is purely the cheaper expert dequant (K-quant shift/mask vs IQ codebook gather); it grows on older CPUs with weaker gather. Decode is CPU-dequant-bound, so it scales with memory bandwidth and core count โ€” AVX-512 / more memory channels go faster. (And --spec-type ngram-cache adds the speculation uplift on top.)


Build provenance

zai-org/GLM-5.2 (BF16 safetensors) โ†’ convert_hf_to_gguf.py (glm-dsa) โ†’ Q8_0 GGUF โ†’ llama-quantize --allow-requantize --imatrix <wikitext.imatrix> --custom-q 'ffn_.*_exps=q3_K' <q8_0> <out> Q8_0.

ik_llama.cpp loader note: some builds require the DSA-indexer tensors (blk.N.indexer.*), but mainline's converter writes them as optional and only on a subset of layers. They are loaded-but-unused in inference, so if loading fails with check_tensor_dims: ...indexer.k_norm.weight not found, mark those create_tensor calls TENSOR_NOT_REQUIRED.

Downloads last month
129
GGUF
Model size
753B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SixVolts/GLM-5.2-ewaste-edition-GGUF

Base model

zai-org/GLM-5.2
Quantized
(62)
this model