Qwen3.5-9B-MXFP4-MTP

MXFP4 quantization of Qwen/Qwen3.5-9B with native MTP (Multi-Token Prediction) heads for speculative decoding.

Model specs

Component Precision Size
Main model (32 layers) mxfp4 group_size=32 ~4.60 GB
MTP decoder layer (x1) mxfp4 (linear projections), bf16 (fc, norms) ~0.25 GB

Total: ~4.8 GB.

  • Architecture: Qwen3.5-9B (Qwen3.5 hybrid attention)
  • Layers: 32 (8 full-attention, 24 linear-attention)
  • Context: 262,144 tokens
  • MTP: 1 decoder layer, separate lm_head (not tied)
  • Quantization: mxfp4 (OCP Microscaling FP4), group_size=32

Use with MLX

pip install -U mlx-lm
python -m mlx_lm generate \
  --model sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP \
  --max-tokens 100 \
  --temperature 0.0 \
  --prompt "Explain quantum computing in one sentence."

MTP speculative decoding

The MTP heads are already in the checkpoint. Use --mtp flag with mlx-lm-mtp for ~24% faster generation.

python -m mlx_lm generate \
  --model sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP \
  --mtp \
  --max-tokens 200 \
  --temperature 0.5 \
  --prompt "Explain the theory of relativity."

Provenance

  • Base weights: Qwen/Qwen3.5-9B
  • Quantized with mlx_lm convert -q --q-group-size 32 --q-bits 4 --q-mode mxfp4
Downloads last month
641
Safetensors
Model size
2B params
Tensor type
U8
U32
BF16
F32
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for sleepyeldrazi/Qwen3.5-9B-MXFP4-MTP

Finetuned
Qwen/Qwen3.5-9B
Quantized
(230)
this model