GLM-4.7-Flash β€” Core AI (gather_qmm kernel, 2.6Γ— faster)

Apple Core AI (.aimodel) conversion of zai-org/GLM-4.7-Flash (text decoder): MLA attention + a 64-expert top-4 sparse MoE (+ non-gated shared expert). 30B total / **3B active per token** β€” a strong local coder.

Part of the community Core AI model zoo: https://github.com/john-rocky/coreai-model-zoo (full card: zoo/glm-4.7-flash.md).

The gather_qmm kernel β€” 20.3 β†’ 52.4 tok/s (2.6Γ—)

Apple's GatherMM reads all 64 experts' weights every token; a custom coreai_torch.TorchMetalKernel reads only the 4 routed experts (4/64) β†’ decode runs at active-param bandwidth: 52.4 tok/s, 2.6Γ— (the biggest relative gain of the zoo's three MoE gather ports β€” a 16Γ— over-read removed).

Quality is clean and unchanged. The kernel reads the sym8 scheme = the same symmetric-linear int8 (per-K-block-32) recipe the standard int8 bundle uses, via a bit-exact gather: 0 introduced flips / 18 vs fp16. Pure speed win at the same quality.

bundle size decode tok/s quality
gpu-pipelined/glm_4_7_flash_decode_sym8_gather/ 30 GB 52.4 clean (0 flips/18 vs fp16) βœ…

Mac-only (30 GB int8). Remaining speed lever = absorbed-MLA (GLM runs full MLA on all 47 layers).

Run

COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model gpu-pipelined/glm_4_7_flash_decode_sym8_gather -p 128 -g 256 -n 3

Convert your own with conversion/export_glm47_moe_metal_decode_pipelined.py.

License

MIT (upstream GLM license). Conversion + gather_qmm kernel: community.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlboydaisuke/GLM-4.7-Flash-CoreAI

Finetuned
(67)
this model