GLM-4.7-Flash β€” Asymmetric 2-bit-Expert GGUF

An asymmetric, imatrix-calibrated GGUF quant of GLM-4.7-Flash (30B-A3B MoE, glm4moe arch β€” internally served via the DeepSeek2 MLA + MoE path).

The design goal: push the routed experts (the overwhelming majority of the weights, but only ~3B active per token) down to 2–3 bits, while keeping every component that is touched on every token at high precision. The result fits a 16 GB GPU with room for a useful context window.

Asymmetric scheme

Component Tensor pattern Type Rationale
Routed experts β€” gate, up ffn_gate_exps, ffn_up_exps IQ2_S bulk of weights, sparsely active
Routed experts β€” down ffn_down_exps IQ3_S down-proj is more quant-sensitive
Shared expert ffn_*_shexp Q6_K active every token
Dense block-0 FFN blk.0.ffn_{gate,up,down} Q6_K dense layer, active every token
Attention (MLA) attn_* Q4_K (attn_k_b→Q5_0, 192-col fallback) small, latency-critical
Token embedding token_embd Q6_K shared in/out vocabulary
Output head output Q6_K logit quality
Base / everything else β€” IQ3_S

Built with the Hyperspace prism fork's llama-quantize using a repeatable --tensor-type REGEX=TYPE plan + --imatrix.

Provenance

  • Source: unsloth/GLM-4.7-Flash-GGUF β†’ BF16 (BF16/GLM-4.7-Flash-BF16-*.gguf, the highest-precision GGUF in the repo).
  • imatrix: bartowski calibration_datav3.txt, 125 chunks @ ctx 512, computed on the BF16 source (imatrix.dat included).
  • Quantize: base IQ3_S + the per-tensor overrides above; --token-embedding-type q6_K --output-tensor-type q6_K.

Size & quality

This quant (asym 2-bit-exp) Q4_K_M baseline
On-disk 10.67 GB (3.06 BPW) 18.31 GB
Wikitext-2 PPL (200 chunks, ctx 512) 10.7749 10.0863

PPL delta: +6.83% for a ~42% smaller file.

16 GB fit

GLM-4.7-Flash uses MLA, so the KV cache is unusually small (compressed latent ~576 elems/layer Γ— 47 layers):

  • Weights: 10.67 GB
  • KV @ 16k ctx (f16): ~0.89 GB
  • KV @ 32k ctx (f16): ~1.77 GB
    • compute/context buffers: ~1–2 GB

β†’ ~12.5–13 GB total at 16k ctx, comfortably inside 16 GB VRAM (32k also fits).

Usage (thinking model)

GLM-4.7-Flash is a reasoning model. To disable the thinking trace, pass chat_template_kwargs: {"enable_thinking": false} with --jinja:

llama-server -m GLM-4.7-Flash-asym-2bitexp.gguf -ngl 99 -c 16384 --jinja
# then POST /v1/chat/completions with:
#   "chat_template_kwargs": {"enable_thinking": false}

Coherence verified on a coding prompt (correct memoized fib, fib(10)=55) and a short reasoning prompt.

Caveats

2-bit routed experts carry a measurable quality cost vs Q4_K_M (see PPL). On adversarial logic riddles the model can occasionally slip; for general coding/chat/reasoning under a tight VRAM budget it stays coherent. Use Q4_K_M or higher if you have the memory.

Downloads last month
260
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hyperspaceai/GLM-4.7-Flash-asym-2bitexp-GGUF

Quantized
(85)
this model