Qwen3.5-4B Instruct — MTP NVFP4 GGUF

NVFP4 (E4M3) 4-bit quantization of Qwen/Qwen3.5-4B, Qwen's 4B-parameter instruction-tuned multimodal model with hybrid DeltaNet-Mamba2-Attention architecture and 262K token context window. Includes MTP (Multi-Token Prediction) support.

This is an Instruct variant: the embedded chat template has been modified so that thinking (<think> reasoning traces) is disabled by default. Pass enable_thinking=true during inference to enable reasoning.

About NVFP4

NVFP4 (E4M3 — 1 sign, 4 exponent, 3 mantissa) is NVIDIA's native 4-bit floating-point format for Blackwell GPUs:

Feature NVFP4
Format E4M3 (1:4:3)
Block size 128 elements
Dynamic range 15 orders of magnitude (6-bit exp)
Zero-point Implicit (true 0)
Hardware Blackwell (RTX 50-series, B200)
Dequant cost None (native support)

Unlike INT4 formats that require zero-point restoration and have limited dynamic range, NVFP4's 6-bit exponent preserves outlier-sensitive values while achieving 4× compression vs FP16.

Files

Filename Type Size Description
qwen35-4b-mtp-nvfp4.gguf NVFP4 quantized model ~2.5 GB Main model weights with MTP head
mmproj-qwen35-4b-f16.gguf F16 multimodal projector ~644 MB Vision encoder for image inputs

Quantization Details

Aspect Detail
Format NVFP4 (E4M3)
Block size 128
Bits per weight 4.92
Hardware target NVIDIA Blackwell (RTX 5090, RTX 5060 Ti, B200, etc.)
VRAM requirement ~4 GB (model + KV cache)
Source format BF16 (original HF weights)
Quantization tool llama-quantize (commit dd7cad7, CUDA 13.2)
MTP layers 1 (nextn)

Model Description

Qwen3.5-4B is Qwen's instruction-tuned model featuring:

  • 3.97B parameters (dense)
  • Hybrid architecture: Gated DeltaNet + Gated Attention + FFN layers
  • Mamba2-style SSM via DeltaNet with gating mechanism
  • 4 full attention layers at regular intervals (full_attention_interval=4)
  • 262K context length (extensible to 1M)
  • 248,320 vocabulary (GPT-2 tokenizer with Qwen3.5 pre-tokenizer)
  • Vision multimodal: image understanding via cross-attention projector
  • MTP (Multi-Token Prediction): trained with multi-step prediction for improved generation

The GGUF uses the QWEN35 architecture handler from llama.cpp with full support for all hybrid layer types.

Instruct Variant: Thinking Disabled by Default

The original Qwen3.5 chat template enables thinking by default — it outputs <think>\n at the start of every assistant response. This repository's GGUF ships with a modified chat template where the default behavior is inverted:

Scenario Behavior
enable_thinking not set ❌ Thinking off — outputs <think>\n\n</think>\n\n (empty think block)
enable_thinking=true ✅ Thinking on — outputs <think>\n (reasoning trace expected)
enable_thinking=false ❌ Thinking off

Usage

llama.cpp CLI

# Text-only inference (thinking off by default)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  -p "Explain quantum computing in simple terms" -n 512

# With thinking enabled
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  -p "Solve this math problem step by step" -n 512

# Multimodal (image input)
llama-cli -m qwen35-4b-mtp-nvfp4.gguf \
  --mmproj mmproj-qwen35-4b-f16.gguf \
  --image path/to/image.jpg -p "Describe this image" -n 256

Download via huggingface-hub

from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
    repo_id="FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF",
    filename="qwen35-4b-mtp-nvfp4.gguf",
)

Conversion Pipeline

  1. Downloaded original BF16 weights from Qwen/Qwen3.5-4B
  2. Converted to F16 GGUF with MTP tensors included
  3. Extracted vision projector as separate mmproj F16 GGUF
  4. Quantized to NVFP4 via llama-quantize.exe NVFP4
  5. Patched chat template for thinking-disabled-by-default
  6. Uploaded to HuggingFace Hub

Verification

from gguf import GGUFReader
r = GGUFReader("qwen35-4b-mtp-nvfp4.gguf")
print(f"Tensors: {len(r.tensors)}")
print(f"MTP layers: {r.fields['qwen35.nextn_predict_layers'].parts[-1]}")

Hardware

Component Detail
GPU NVIDIA Blackwell (RTX 5060 Ti)
CUDA Toolkit 13.2
System RAM 64 GB

License

Apache-2.0 (same as the original Qwen3.5-4B model).

Downloads last month
531
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-4B-Instruct-MTP-NVFP4-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(235)
this model