Qwen3.5-9B-NVFP4-GGUF

NVFP4 GGUF quantization of Qwen3.5-9B, Alibaba Cloud's efficient 9B multimodal foundation model with 262K context, 201 languages, and hybrid Gated DeltaNet + Gated Attention architecture.

Optimized for NVIDIA Blackwell GPUs with native FP4 tensor core acceleration.

About NVFP4

What is NVFP4?

NVFP4 is NVIDIA's native 4-bit floating-point quantization format introduced with the Blackwell architecture (SM120+). Unlike traditional integer quantization (Q4_0, Q4_K_M, etc.), NVFP4 stores weights in FP4 (E4M3) format — a 4-bit floating-point representation with 1 sign bit, 4 exponent bits, and 3 mantissa bits.

The key difference from INT4 formats:

Property NVFP4 (FP4) INT4 (Q4_X)
Representation Floating-point Integer
Dynamic range ~±240 ~±7
Block size 16 32
Scale format FP16 (E4M3) FP16
Hardware support Blackwell SM120+ (native tensor cores) All GPUs (software)
Zero-shot perplexity Near-identical to FP16 Slight degradation

Why use NVFP4?

  1. Blackwell-native acceleration: NVFP4 is processed natively on Blackwell FP4 tensor cores, delivering up to 2x throughput vs INT4 software kernels on the same hardware.

  2. Better dynamic range: Floating-point 4-bit preserves more information for outlier weights compared to integer quantization, resulting in lower perplexity degradation.

  3. Memory efficiency: At ~4.74 bits per weight (BPW), a 9B model fits in ~5 GB — well within 16 GB VRAM with room for 262K context.

  4. No dequantization overhead: Unlike INT4 formats that require runtime dequantization, FP4 operates directly on tensor cores for both compute and memory bandwidth.

When to use NVFP4 vs other formats

  • NVFP4: Best choice if you have a Blackwell GPU (RTX 5060 Ti, RTX 5090, B200, etc.)
  • Q4_K_M / Q4_0: Better for pre-Blackwell GPUs (Ampere, Ada Lovelace) or CPU inference
  • Q8_0 / F16: Use when maximum quality is needed and memory is not a constraint

Files

Filename Type Size Description
qwen3.5-9b-nvfp4.gguf NVFP4 5.31 GB Quantized text model weights
mmproj-qwen3.5-9b-nvfp4-f16.gguf F16 mmproj 918 MB Vision encoder projector

Quantization Details

Property Value
Format NVIDIA NVFP4 (E4M3 FP4)
Block size 16
Effective BPW 4.74
Hardware target Blackwell SM120+
VRAM required (text) ~5 GB
VRAM required (text + vision + 32K ctx) ~7 GB

Model Description

Qwen3.5-9B features:

  • Hybrid architecture: Alternating linear attention (Gated DeltaNet) and full attention layers for efficient long-context processing
  • 262K context window: Native support for 262,144 token sequences
  • Vision capabilities: Built-in vision encoder with 27-layer ViT for image understanding
  • Multi-Token Prediction: MTP head enabling speculative decoding for faster generation
  • Multilingual: Strong performance across 201 languages

Usage

llama.cpp (CLI)

# Text + Image
llama-cli \
  -m qwen3.5-9b-nvfp4.gguf \
  --mmproj mmproj-qwen3.5-9b-nvfp4-f16.gguf \
  --image photo.jpg \
  -p "Describe this image in detail"

# Text only
llama-cli \
  -m qwen3.5-9b-nvfp4.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 512

# OpenAI-compatible server
llama-server \
  -m qwen3.5-9b-nvfp4.gguf \
  --mmproj mmproj-qwen3.5-9b-nvfp4-f16.gguf \
  --port 8080

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="FreedomAISVR/Qwen3.5-9B-NVFP4-GGUF",
    filename="qwen3.5-9b-nvfp4.gguf",
    n_gpu_layers=-1,
)

response = llm.create_chat_completion([
    {"role": "user", "content": "What is the capital of France?"}
])
print(response["choices"][0]["message"]["content"])

Direct download

from huggingface_hub import hf_hub_download

for filename in ["qwen3.5-9b-nvfp4.gguf", "mmproj-qwen3.5-9b-nvfp4-f16.gguf"]:
    hf_hub_download(
        repo_id="FreedomAISVR/Qwen3.5-9B-NVFP4-GGUF",
        filename=filename,
        local_dir="./models"
    )

Quantization Pipeline

1. Download source weights
   huggingface_hub.snapshot_download("Qwen/Qwen3.5-9B")

2. Convert text model to F16 GGUF
   convert_hf_to_gguf.py --outtype f16

3. Extract vision encoder
   convert_hf_to_gguf.py --mmproj --outtype f16

4. Quantize to NVFP4
   llama-quantize qwen3.5-9b-f16.gguf qwen3.5-9b-nvfp4.gguf NVFP4

Quantization completed in ~5 minutes on the hardware below.

Hardware

Component Specification
GPU NVIDIA GeForce RTX 5060 Ti (Blackwell SM120)
VRAM 16 GB GDDR7
System RAM 64 GB DDR4

License

The original Qwen3.5-9B is released under Apache 2.0. These quantized weights inherit that license.

Downloads last month
1,455
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-9B-NVFP4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(287)
this model

Collection including FreedomAISVR/Qwen3.5-9B-NVFP4-GGUF