Qwen3.5-9BB-Instruct-MTP-MXFP4-GGUF

About MXFP4

MXFP4 (Microscaling FP4, OCP MXFP4 E2M1) is an open-standard 4-bit format developed by the OCP Microscaling Formats (MX) consortium. Key characteristics:

Property MXFP4 NVFP4
Format E2M1 (1 sign, 2 exponent, 1 mantissa) E4M3 (1 sign, 4 exponent, 3 mantissa)
Block size 32 elements 128 elements
Hardware CPU, GPU (all vendors) NVIDIA Blackwell only
Standard OCP Open Standard NVIDIA proprietary

MXFP4 is the dense-model variant (for MoE models, use MXFP4_MOE).

Files

Filename Type Size Description
qwen35-9Bb-instruct-mtp-mxfp4.gguf MXFP4 quantized model 5.18 GB Main model weights (MXFP4, 1 MTP head)
mmproj-qwen35-9Bb-f16.gguf Multimodal projector 875 MB Vision encoder (SigLIP, F16)
README.md Documentation - This file

Quantization Details

Parameter Value
Format MXFP4 (OCP E2M1)
Block size 32
Bits per weight 4.72
Hardware target CPU, AMD GPU, NVIDIA GPU (all), Blackwell
VRAM required ~6.0 GB
MTP head Yes (1 layer, nextn_predict_layers=1)

Model Description

Qwen3.5-9BB is a multilingual vision-language model with 32 transformer blocks, 262k context length, and 1 MTP (Multi-Token Prediction) head. It supports:

  • Text generation (multilingual: EN, ZH, code)
  • Vision understanding (image + video)
  • Tool calling / function calling
  • Thinking mode (reasoning) — disabled by default in this variant

Thinking behavior: This variant has thinking disabled by default. To enable thinking, pass enable_thinking=true in the generation parameters. This makes the model output reasoning tokens before the final answer. This variant matches the standard "Instruct" behavior.

Architecture Details

  • 32 transformer blocks (hybrid attention + FFN + SSM)
  • 1 MTP prediction head (block_count=33)
  • 1 SigLIP vision encoder (mmproj)
  • 262,144 token context window
  • 2560-dim (4B) / 4096-dim (9B) hidden size

Usage

llama.cpp CLI

# Basic text generation (thinking disabled by default)
./llama-cli -m qwen35-9Bb-instruct-mtp-mxfp4.gguf \
  --mmproj mmproj-qwen35-9Bb-f16.gguf \
  -p "Hello, how are you?" \
  -n 256

# Enable thinking
./llama-cli -m qwen35-9Bb-instruct-mtp-mxfp4.gguf \
  --mmproj mmproj-qwen35-9Bb-f16.gguf \
  -p "Solve this math problem step by step" \
  -n 512 \
  -p "enable_thinking=true"

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="qwen35-9Bb-instruct-mtp-mxfp4.gguf",
    mmproj="mmproj-qwen35-9Bb-f16.gguf",
    n_ctx=8192,
)

# Thinking disabled by default
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}]
)

# Enable thinking
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Think step by step"}],
    extra_body={"enable_thinking": True}
)

Download from HuggingFace Hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="FreedomAISVR/Qwen3.5-9BB-Instruct-MTP-MXFP4-GGUF",
    filename="qwen35-9Bb-instruct-mtp-mxfp4.gguf"
)
mmproj_path = hf_hub_download(
    repo_id="FreedomAISVR/Qwen3.5-9BB-Instruct-MTP-MXFP4-GGUF",
    filename="mmproj-qwen35-9Bb-f16.gguf"
)

Conversion Pipeline

HF weights (BF16)
  → patch tokenizer_config.json (thinking disabled by default)
  → convert_hf_to_gguf.py (F16, with MTP, no --no-mtp)
  → llama-quantize.exe MXFP4

Hardware

Component Value
GPU RTX 5060 Ti (16 GB VRAM)
System RAM 128 GB
Quantization time ~44 sec (4B) / ~1.7 min (9B)

License

Apache 2.0 (same as Qwen3.5-9BB).

Downloads last month
588
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-9B-Instruct-MTP-MXFP4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(284)
this model