Qwen3.5-9B-Instruct-MTP-NVFP4-GGUF

NVFP4 (E4M3) GGUF quantization of Qwen/Qwen3.5-9B, an instruct-tuned multimodal 9B model with hybrid Mamba2-Attention architecture, 262K context window, and MTP (Multi-Token Prediction) support.

Quantized to NVIDIA's native FP4 format (NVFP4, E4M3) for Blackwell GPUs, delivering 4.73 bits per weight with Blackwell-native tensor core acceleration.

About NVFP4

Feature NVFP4 (E4M3) INT4 (e.g., Q4_K_M)
Block size 128 elements Variable (K-quant blocks)
Dynamic range ±57344 (FP4 E4M3) 0–7 (INT4, symmetric)
Hardware target Blackwell tensor cores All GPUs / CPU
Dequantization overhead None (native FP4 compute) Required
BPW ~4.73 ~4.50–5.50

Files

File Type Size Description
qwen35-9b-mtp-nvfp4.gguf NVFP4 quantized model 5.08 GB Main model weights with MTP head (33 layers, 442 tensors)
mmproj-qwen35-9b-f16.gguf Vision encoder (F16) 0.86 GB Multimodal projector for image/video input

Quantization Details

Parameter Value
Quantization format NVFP4 (E4M3, block 128)
BPW 4.73
File type 39 (NVFP4)
Hardware target NVIDIA Blackwell (RTX 50-series)
MTP layers 1 (nextn)

Original Model Description

Qwen3.5-9B is a 9-billion-parameter multimodal model from the Qwen team at Alibaba. Key features:

  • Hybrid Mamba2-Attention: Alternating Mamba2 and full attention layers (32 layers, 4:1 ratio)
  • Multimodal: Native image/video support via SigLIP vision encoder
  • 262K context window with MRoPE
  • MTP (Multi-Token Prediction): Multi-step prediction head for improved generation quality
  • Instruct-tuned with tool-use support
  • Thinking control: Reasoning via enable_thinking parameter (off by default)

Usage

llama.cpp CLI

# Text-only
llama-cli -m qwen35-9b-mtp-nvfp4.gguf \
  -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"

# Multimodal
llama-cli -m qwen35-9b-mtp-nvfp4.gguf \
  --mmproj mmproj-qwen35-9b-f16.gguf \
  --image photo.jpg \
  -p "Describe this image"

Python (llama-cpp-python)

from llama_cpp import Llama
llm = Llama(model_path="qwen35-9b-mtp-nvfp4.gguf", n_gpu_layers=-1)
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}]
)

Download via huggingface-hub

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="FreedomAISVR/Qwen3.5-9B-Instruct-MTP-NVFP4-GGUF",
    filename="qwen35-9b-mtp-nvfp4.gguf"
)

Thinking Control

Thinking (reasoning mode) is disabled by default. Pass enable_thinking=true to enable.

Conversion Pipeline

# 1. Download source
huggingface-cli download Qwen/Qwen3.5-9B --local-dir qwen35-9b-src
# 2. Convert to F16 GGUF (MTP auto-included)
python convert_hf_to_gguf.py qwen35-9b-src --outfile qwen35-9b-bf16.gguf
# 3. Extract mmproj
python convert_hf_to_gguf.py qwen35-9b-src --outfile qwen35-9b-bf16.gguf --mmproj
# 4. Quantize to NVFP4
llama-quantize qwen35-9b-bf16.gguf qwen35-9b-nvfp4.gguf NVFP4

Verification

from gguf import GGUFReader
r = GGUFReader("qwen35-9b-mtp-nvfp4.gguf")
print(f"Architecture: {r.fields['general.architecture'].parts[-1]}")
print(f"Block count: {r.fields['qwen35.block_count'].parts[-1]}")
print(f"MTP layers: {r.fields['qwen35.nextn_predict_layers'].parts[-1]}")
print(f"Tensors: {len(r.tensors)}")
# Expected: qwen35, 33, 1, 442

License

Apache-2.0 (same as the original Qwen3.5-9B model).

Downloads last month
712
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-9B-Instruct-MTP-NVFP4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(280)
this model