Qwen3.5-9B-Instruct-MXFP4-GGUF

MXFP4 (OCP Microscaling FP4) GGUF quantization of the Qwen3.5-9B-Instruct multimodal language model with thinking disabled by default.

About MXFP4

MXFP4 (E2M1) is the open-standard 4-bit floating-point format defined by the OCP Microscaling Formats (MX) specification. Unlike block-scaled integer formats (Q4_K_M), MXFP4 uses per-group scaling factors (block size 32) with shared exponent via the E2M1 format (2 exponent bits, 1 mantissa bit). This provides better dynamic range utilization than integer formats and works across GPU vendors (NVIDIA, AMD, Intel) and CPUs.

Key Advantages

  • Cross-vendor: Runs on any GPU or CPU (no vendor lock-in)
  • E2M1 format: 2 exponent bits + 1 mantissa bit + shared scale per 32 elements
  • Floating-point: Better dynamic range than integer quantization
  • Open standard: OCP MX specification, widely supported

Repo Contents

Filename Type Size Description
qwen35-9b-instruct-mxfp4.gguf Text model 5.31 GB MXFP4 quantized model (no MTP head)
mmproj-qwen35-9b-f16.gguf Vision encoder 922 MB SigLIP vision projector (F16)

Quantization Details

Property Value
Format MXFP4 (OCP E2M1)
Block size 32 elements
BPW ~4.74
Architecture qwen35 (no MTP, 427 tensors)
Target hardware Universal (CPU, AMD, NVIDIA, Intel)
Thinking Disabled by default (opt-in via enable_thinking=true)
MTP Disabled (--no-mtp)

Usage

llama.cpp

# Basic text generation (thinking disabled by default)
./llama-cli -m qwen35-9b-instruct-mxfp4.gguf -p "What is the capital of France?" -n 256

# Vision (requires mmproj)
./llama-cli -m qwen35-9b-instruct-mxfp4.gguf --mmproj mmproj-qwen35-9b-f16.gguf -p "Describe this image" --image photo.jpg -n 256

# Enable thinking
./llama-cli -m qwen35-9b-instruct-mxfp4.gguf -p "Solve: 2+2=?" -n 512 -e enable_thinking=true

Python

from llama_cpp import Llama

llm = Llama(
    model_path="qwen35-9b-instruct-mxfp4.gguf",
    n_ctx=8192,
    chat_format="qwen3",
)

# Basic chat (no thinking)
output = llm.create_chat_completion([{"role": "user", "content": "What is AI?"}])

# With thinking enabled
output = llm.create_chat_completion(
    [{"role": "user", "content": "Solve 2+2=?"}],
    extra_body={"enable_thinking": True},
)

Download

huggingface-cli download FreedomAISVR/Qwen3.5-9B-Instruct-MXFP4-GGUF --local-dir . --local-dir-use-symlinks False

Original Model

Qwen3.5-9B is Alibaba Cloud's efficient multimodal foundation model (Apache 2.0, March 2026) featuring:

  • Hybrid Gated DeltaNet + Gated Attention (3:1 ratio)
  • 262K native context (extensible to 1M)
  • Text + Image + Video input
  • 201 language support

Conversion

python convert_hf_to_gguf.py --no-mtp --outfile qwen35-9b-f16.gguf D:\qwen35-9b-src
python convert_hf_to_gguf.py --mmproj --outfile mmproj-qwen35-9b-f16.gguf D:\qwen35-9b-src
llama-quantize.exe qwen35-9b-f16.gguf qwen35-9b-instruct-mxfp4.gguf MXFP4

License

Apache 2.0 (same as Qwen3.5-9B)

Downloads last month
920
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-9B-Instruct-MXFP4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(273)
this model