GLM-4.6V-Flash NVFP4 GGUF

Quantized GGUF version of zai-org/GLM-4.6V-Flash by Z.ai (Zhipu AI), converted to NVFP4 (4-bit NVIDIA FP4) format.

Model Details

  • Base model: zai-org/GLM-4.6V-Flash — 9B parameter vision-language model by Z.ai with 40 transformer layers, 4096 hidden dim, 32 attention heads (8 KV heads), SwiGLU activation. Paper: 2507.01006.
  • Vision encoder: 24-layer ViT (1536 hidden dim, 1536/4096 attention dim, 13696 intermediate FFN)
  • Context length: 128K tokens
  • Quantization: NVFP4 — NVIDIA 4-bit FP4 format with Per-Group UE4M3 scales (4.64 BPW, 5.08 GB)
  • Thinking: Enabled by default (native <think>/</think> tokens, opt-out via enable_thinking=false)

Files

File Size Description
glm-4.6v-flash-nvfp4.gguf 5.08 GB Quantized text model (523 tensors, 4.64 BPW)
mmproj-glm-4.6v-flash-f16.gguf 1.66 GB Vision encoder projector (182 tensors, F16)

Usage

LM Studio

Load both files — the text GGUF as the main model and the mmproj as the vision encoder. Supports multimodal inputs (images + text).

llama.cpp

./llama-llava-cli \
  -m glm-4.6v-flash-nvfp4.gguf \
  --mmproj mmproj-glm-4.6v-flash-f16.gguf \
  -p "Describe this image in detail." \
  --image path/to/image.jpg

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="glm-4.6v-flash-nvfp4.gguf",
    mmproj="mmproj-glm-4.6v-flash-f16.gguf",
    n_ctx=32768
)

output = llm.create_chat_completion(
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "image.jpg"}},
            {"type": "text", "text": "What's in this image?"}
        ]
    }]
)
print(output["choices"][0]["message"]["content"])

Quantization Details

  • Source: zai-org/GLM-4.6V-Flash → F16 GGUF → llama-quantize.exe NVFP4
  • Block size: 64 elements; Per-Group UE4M3 scales (4 scales per block)
  • Output tensor: Q6_K (higher precision for the final projection)
  • Architecture: glm4 with 523 tensors (40 transformer layers, vision embedder)

Hardware Compatibility

  • Requires NVIDIA Blackwell (RTX 50 series) for native FP4 compute via CUDA Blackwall
  • Falls back to FP4 dequantization on older GPUs (slower but functional)
  • CPU inference supported via software dequant (significantly slower)
Downloads last month
-
GGUF
Model size
9B params
Architecture
glm4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/GLM-4.6V-Flash-NVFP4-GGUF

Quantized
(45)
this model

Paper for FreedomAISVR/GLM-4.6V-Flash-NVFP4-GGUF