GLM-4.7-Flash-NVFP4-GGUF

GGUF quantization of zai-org/GLM-4.7-Flash โ€” a 30B-parameter Mixture-of-Experts language model with ~3.2B active parameters per token, built on the DeepSeek2 architecture with Multi-head Latent Attention (MLA) and 64 routed experts.

Quantized to NVFP4 format for efficient inference with minimal quality loss.

About NVFP4

NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3) for Blackwell GPUs. It stores weights in FP4 with a shared per-block scale, enabling native Blackwell tensor core acceleration with no dequantization overhead during inference. Compared to INT4 formats, NVFP4 offers better dynamic range (E4M3 vs E2M1) and maintains higher quality at similar bit widths.

Files

Filename Type Size Description
glm-4.7-flash-nvfp4.gguf GGUF (NVFP4) 15.79 GB Quantized model weights
README.md Markdown - Model card

Quantization Details

Property Value
Format NVFP4
Bits Per Weight 4.53 BPW
File Size 15.79 GB
Tensor Count 844
Architecture DeepSeek2 (custom for GLM-4.7-Flash)

Model Description

  • Developer: Zhipu AI
  • Architecture: Mixture-of-Experts (MoE) with DeepSeek2-style MLA
  • Parameters: ~30B total, ~3.2B active per token
  • Context Length: 200,000 tokens
  • Layers: 47 transformer layers
  • Attention: Multi-head Latent Attention (q_lora_rank=768, kv_lora_rank=512)
  • Experts: 64 routed experts (4 per token) + 1 shared expert
  • Vocab Size: 151,936
  • Languages: English, Chinese
  • Thinking: Enabled by default (native <think>/</think> tokens, hidden in history for clean multi-turn reasoning)
  • Pipeline: text-generation only (no vision encoder)

Usage

llama.cpp

# Basic generation
./llama-cli -m glm-4.7-flash-nvfp4.gguf \
  -p "Hello, how are you?" \
  -n 256

# With thinking/reasoning controlled
./llama-cli -m glm-4.7-flash-nvfp4.gguf \
  -p "Solve this step by step: 23 * 47" \
  -n 512 \
  -no-cnv

HuggingFace Hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF",
    filename="glm-4.7-flash-nvfp4.gguf",
    repo_type="model"
)

Pipeline Commands

Source: zai-org/GLM-4.7-Flash (58 GB, 48 safetensor shards)

  1. F16 GGUF Conversion:

    python convert_hf_to_gguf.py D:\AI_MODELS\glm-4.7-src --outfile glm-4.7-f16.gguf --outtype f16
    

    Output: 55.79 GB, 844 tensors (DeepSeek2 arch, Glm4MoeLiteModel)

  2. NVFP4 Quantization:

    llama-quantize.exe glm-4.7-f16.gguf glm-4.7-flash-nvfp4.gguf NVFP4
    

    Duration: ~310s on RTX 5060 Ti

Hardware

Component Specification
GPU NVIDIA RTX 5060 Ti 16 GB (Blackwell)
System RAM 64 GB
Storage D: (NVMe)

License

MIT โ€” same as the original zai-org/GLM-4.7-Flash.

Downloads last month
-
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FreedomAISVR/GLM-4.7-Flash-NVFP4-GGUF

Quantized
(83)
this model