gdubicki/GLM-4.7-Flash-NVFP4

Public mirror of GadflyII/GLM-4.7-Flash-NVFP4.

This mirror exists to provide a pinned, stable reference for deployment on DGX Spark (GB10). Use the upstream repo if you want to track author updates.

Credits

Model details

  • Architecture: glm4_moe_lite — GLM-4 MoE (lite variant)
  • Parameters: ~31B total (MoE, fraction active per token)
  • Quantization: NVFP4 via compressed-tensors
  • KV cache: FP8 (recommended on GB10)
  • Max context: inherits from base (zai-org/GLM-4.7-Flash)

Usage

docker run --rm --runtime=nvidia --gpus all \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  vllm/vllm-openai:cu130-nightly \
  gdubicki/GLM-4.7-Flash-NVFP4 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.30 \
  --enable-chunked-prefill \
  --enable-prefix-caching
Downloads last month
175
Safetensors
Model size
18B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gdubicki/GLM-4.7-Flash-NVFP4

Quantized
(1)
this model