Qwen3.5-9B-MXFP4-MTP

MXFP4 quantization of Qwen/Qwen3.5-9B with native MTP (Multi-Token Prediction) heads for speculative decoding.

Model specs

Component	Precision	Size
Main model (32 layers)	mxfp4 group_size=32	~4.60 GB
MTP decoder layer (x1)	mxfp4 (linear projections), bf16 (fc, norms)	~0.25 GB

Total: ~4.8 GB.

Architecture: Qwen3.5-9B (Qwen3.5 hybrid attention)
Layers: 32 (8 full-attention, 24 linear-attention)
Context: 262,144 tokens
MTP: 1 decoder layer, separate lm_head (not tied)
Quantization: mxfp4 (OCP Microscaling FP4), group_size=32

Use with MLX

pip install -U mlx-lm

python -m mlx_lm generate \
  --model sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP \
  --max-tokens 100 \
  --temperature 0.0 \
  --prompt "Explain quantum computing in one sentence."

MTP speculative decoding

The MTP heads are already in the checkpoint. Use --mtp flag with mlx-lm-mtp for ~24% faster generation.

python -m mlx_lm generate \
  --model sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP \
  --mtp \
  --max-tokens 200 \
  --temperature 0.5 \
  --prompt "Explain the theory of relativity."

Provenance

Base weights: Qwen/Qwen3.5-9B
Quantized with mlx_lm convert -q --q-group-size 32 --q-bits 4 --q-mode mxfp4

Downloads last month: 641

Safetensors

Model size

2B params

Tensor type

U32

BF16

F32

MLX

Hardware compatibility

4-bit

Model tree for sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(230)

this model