Qwen3.6-27B-MLX-4bit-MTP

English | 中文

Native MTP (Multi-Token Prediction) weights preserved for MLX inference. The official mlx-community releases strip MTP weights during sanitize(), breaking speculative decoding support.

原生保留 MTP(多 token 预测)权重的 MLX 量化版本。官方 mlx-community 发布版在 sanitize() 阶段会丢弃 MTP 权重,导致无法使用投机解码加速。


Introduction

This is a 4-bit quantized MLX conversion of Qwen/Qwen3.6-27B with native MTP (Multi-Token Prediction) weights fully preserved.

Most pre-converted MLX models on HuggingFace (including mlx-community releases) remove MTP weights during the sanitize() step, making it impossible to use Qwen3.6's built-in speculative decoding acceleration. This model fixes that.

Why This Matters

Qwen3.6 models include a dedicated MTP head (an extra transformer layer) trained to predict token t+2 from the hidden state at t and the embedding of t+1. When enabled during inference:

  • Speedup: ~1.5-2x faster generation compared to standard autoregressive decoding
  • Quality: Zero quality loss — MTP is trained end-to-end with the base model
  • No draft model needed: Unlike DFlash or standard speculative decoding, MTP uses the model's own weights — no separate draft model to load or manage

Official MLX conversions strip these weights because mlx-lm's sanitize() method drops mtp.* keys. This repository preserves them.

Model Details

Property Value
Base Model Qwen/Qwen3.6-27B
Quantization 4-bit affine, group_size=64
Total Parameters ~27B
Context Length 262,144
Architecture Qwen3.5 (Gated DeltaNet hybrid)
MTP Layers 1
File Size ~14.4GB (3 safetensors)
Compatible With mlx-lm feat/mtp-batched branch

Performance (M5 Max, 64GB RAM)

Mode Tokens/sec Notes
Standard ~30-35 Baseline
MTP enabled ~45-55 ~1.5x speedup
MTP + DFlash ❌ Incompatible Cannot combine
MTP + TurboQuant KV ❌ Incompatible MTP needs exact attention path

MTP acceptance rate observed: 74-84% in continuous conversation tests (15K+ tokens).

Required Branch / Version

You MUST use the feat/mtp-batched branch of mlx-lm. The official PyPI release and mlx-community conversions do NOT support MTP inference — they strip mtp.* weights during sanitize().

Component Required Path / Branch
mlx-lm install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
Model on HuggingFace samwang0041/Qwen3.6-27B-MLX-4bit-MTP
Original base model Qwen/Qwen3.6-27B
# Install the exact branch
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

Model Directory Structure

After mlx_lm.load() downloads or after manual conversion, the model directory contains:

qwen3.6-27b-4bit-mtp/
├── config.json                          # Model config (includes text_config.mtp_num_hidden_layers: 1)
├── tokenizer.json                       # BPE tokenizer
├── tokenizer_config.json
├── chat_template.jinja                  # Qwen3.6 chat template
├── model.safetensors.index.json         # Weight map
├── model-00001-of-00003.safetensors     # ~5.0GB (backbone layers 0-21)
├── model-00002-of-00003.safetensors     # ~5.4GB (backbone layers 22-42 + lm_head)
└── model-00003-of-00003.safetensors     # ~4.7GB (backbone layers 43-63 + MTP weights)

MTP weights are inside shard 3 under these keys (as seen in model.safetensors.index.json):

  • language_model.mtp.fc.weight
  • language_model.mtp.layers.0.input_layernorm.weight
  • language_model.mtp.layers.0.mlp.down_proj.weight
  • language_model.mtp.layers.0.mlp.gate_up_proj.weight
  • language_model.mtp.layers.0.post_attention_layernorm.weight
  • language_model.mtp.layers.0.self_attn.k_proj.weight
  • language_model.mtp.layers.0.self_attn.o_proj.weight
  • language_model.mtp.layers.0.self_attn.q_proj.weight
  • language_model.mtp.layers.0.self_attn.v_proj.weight
  • language_model.mtp.emb.weight
  • language_model.mtp.lm_head.weight
  • ... (31 total MTP weight keys)

How to Use

CLI generation:

python -m mlx_lm.generate \
  --model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
  --prompt "Explain quantum computing" \
  --max-tokens 512 \
  --mtp

Python API:

from mlx_lm import load, generate

# Auto-downloads from HuggingFace if not cached locally
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")

response = generate(
    model, tokenizer,
    prompt="Explain quantum computing",
    max_tokens=512,
    verbose=True,
    mtp=True,  # Enable MTP speculative decoding
)

Load from local path:

# If you downloaded/converted manually
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")

With omlx server:

Add to ~/.omlx/model_settings.json:

{
  "qwen3.6-27b-4bit-mtp": {
    "mtp_enabled": true,
    "turboquant_kv_enabled": false,
    "dflash_enabled": false,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 20
  }
}

The model directory should be placed at:

~/.omlx/models/qwen3.6-27b-4bit-mtp/

Conversion Process

This model was converted using the following steps:

# 1. Install mlx-lm with MTP support
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

# 2. Download original weights and convert with quantization
python -m mlx_lm convert \
  --hf-path Qwen/Qwen3.6-27B \
  --mlx-path qwen3.6-27b-4bit-mtp \
  -q --q-bits 4 --q-group-size 64

# 3. The key: feat/mtp-batched branch preserves mtp.* weights during sanitize()
#    Official mlx-lm releases strip them, breaking speculative decoding

The conversion was performed on an Apple M5 Max with 64GB RAM. Original BF16 weights (54GB) were loaded and quantized to 4-bit (14.4GB).

KV Cache Stability

Extensively tested with continuous multi-turn conversations (10+ rounds, 15K+ tokens):

  • ✅ No KV cache corruption
  • ✅ No SSM state rollback errors
  • ✅ Stable MTP acceptance rates across long sessions
  • ✅ Suitable for production use as an OpenAI-compatible API server

Important Notes

  1. MTP vs DFlash: These are mutually exclusive. MTP uses the model's own weights; DFlash uses a separate draft model. Pick one.
  2. MTP vs TurboQuant KV: Also mutually exclusive. MTP requires exact attention computation paths; TurboQuant approximates KV cache, breaking MTP's draft validation.
  3. Only Qwen3.5/3.6 architecture: MTP is specific to Qwen3.5/3.6's architecture. Other models (Llama, Gemma, etc.) do not have this feature.
  4. Requires feat/mtp-batched branch: The official PyPI mlx-lm does not support MTP inference yet.

Related Links

License

This model follows the same license as the original Qwen/Qwen3.6-27B: Apache 2.0

The conversion and upload are provided as-is for community use. No warranties expressed or implied.


简介

这是 Qwen/Qwen3.6-27B4-bit MLX 量化版本完整保留了原生 MTP(多 Token 预测)权重

HuggingFace 上大多数预转换的 MLX 模型(包括 mlx-community 发布版)在 sanitize() 阶段会移除 MTP 权重,导致无法使用 Qwen3.6 内置的投机解码加速。这个模型解决了这个问题。

为什么重要

Qwen3.6 模型包含一个专门的 MTP Head(额外的一个 Transformer 层),训练目标是从位置 t 的隐藏状态预测 token t+2。启用后:

  • 速度提升:相比标准自回归解码,生成速度提升约 1.5-2 倍
  • 零质量损失:MTP 与基础模型端到端联合训练,不会降低输出质量
  • 无需草稿模型:与 DFlash 或标准投机解码不同,MTP 使用模型自身的权重,不需要加载单独的草稿模型

官方 MLX 转换会丢弃这些权重,因为 mlx-lmsanitize() 方法会过滤掉 mtp.* 键。本仓库完整保留了它们。

模型详情

属性
基础模型 Qwen/Qwen3.6-27B
量化方式 4-bit affine, group_size=64
总参数量 ~27B
上下文长度 262,144
架构 Qwen3.5(Gated DeltaNet 混合)
MTP 层数 1
文件大小 ~14.4GB(3 个 safetensors)
兼容环境 mlx-lm feat/mtp-batched 分支

性能表现(M5 Max, 64GB 内存)

模式 Tokens/秒 说明
标准解码 ~30-35 基线
启用 MTP ~45-55 ~1.5 倍加速
MTP + DFlash ❌ 不兼容 不能同时开启
MTP + TurboQuant KV ❌ 不兼容 MTP 需要精确的注意力路径

在连续对话测试中(15K+ tokens),MTP 接受率稳定在 **74-84%**。

必须使用的分支 / 版本

必须使用 mlx-lmfeat/mtp-batched 分支。 PyPI 正式版和 mlx-community 的转换模型不支持 MTP 推理 — 它们会在 sanitize() 阶段丢弃 mtp.* 权重。

组件 所需路径 / 分支
mlx-lm 安装 git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
HuggingFace 模型 samwang0041/Qwen3.6-27B-MLX-4bit-MTP
原始基础模型 Qwen/Qwen3.6-27B
# 安装指定分支
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

模型目录结构

mlx_lm.load() 自动下载或手动转换后,模型目录结构如下:

qwen3.6-27b-4bit-mtp/
├── config.json                          # 模型配置(包含 text_config.mtp_num_hidden_layers: 1)
├── tokenizer.json                       # BPE 分词器
├── tokenizer_config.json
├── chat_template.jinja                  # Qwen3.6 对话模板
├── model.safetensors.index.json         # 权重映射表
├── model-00001-of-00003.safetensors     # ~5.0GB(主干层 0-21)
├── model-00002-of-00003.safetensors     # ~5.4GB(主干层 22-42 + lm_head)
└── model-00003-of-00003.safetensors     # ~4.7GB(主干层 43-63 + MTP 权重)

MTP 权重位于分片 3 中,键名如下(可在 model.safetensors.index.json 中查看):

  • language_model.mtp.fc.weight
  • language_model.mtp.layers.0.input_layernorm.weight
  • language_model.mtp.layers.0.mlp.down_proj.weight
  • language_model.mtp.layers.0.mlp.gate_up_proj.weight
  • language_model.mtp.layers.0.post_attention_layernorm.weight
  • language_model.mtp.layers.0.self_attn.k_proj.weight
  • language_model.mtp.layers.0.self_attn.o_proj.weight
  • language_model.mtp.layers.0.self_attn.q_proj.weight
  • language_model.mtp.layers.0.self_attn.v_proj.weight
  • language_model.mtp.emb.weight
  • language_model.mtp.lm_head.weight
  • ...(共 31 个 MTP 权重键)

使用方法

命令行生成:

python -m mlx_lm.generate \
  --model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
  --prompt "解释量子计算" \
  --max-tokens 512 \
  --mtp

Python API:

from mlx_lm import load, generate

# 如未本地缓存,自动从 HuggingFace 下载
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")

response = generate(
    model, tokenizer,
    prompt="解释量子计算",
    max_tokens=512,
    verbose=True,
    mtp=True,  # 启用 MTP 投机解码
)

从本地路径加载:

# 如果你手动下载或转换过
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")

配合 omlx 服务器:

~/.omlx/model_settings.json 中添加:

{
  "qwen3.6-27b-4bit-mtp": {
    "mtp_enabled": true,
    "turboquant_kv_enabled": false,
    "dflash_enabled": false,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 20
  }
}

模型目录应放置在:

~/.omlx/models/qwen3.6-27b-4bit-mtp/

转换过程

本模型通过以下步骤转换:

# 1. 安装支持 MTP 的 mlx-lm
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

# 2. 下载原始权重并进行量化转换
python -m mlx_lm convert \
  --hf-path Qwen/Qwen3.6-27B \
  --mlx-path qwen3.6-27b-4bit-mtp \
  -q --q-bits 4 --q-group-size 64

# 3. 关键点:feat/mtp-batched 分支在 sanitize() 时保留了 mtp.* 权重
#    官方 mlx-lm 发布版会丢弃它们,导致投机解码失效

转换在 Apple M5 Max(64GB 内存)上完成。原始 BF16 权重约 54GB,量化后约 14.4GB。

KV Cache 稳定性

经过持续多轮对话的充分测试(10+ 轮,15K+ tokens):

  • ✅ 无 KV cache 损坏
  • ✅ 无 SSM 状态回滚错误
  • ✅ 长会话中 MTP 接受率稳定
  • ✅ 可作为 OpenAI 兼容 API 服务器用于生产环境

重要提示

  1. MTP 与 DFlash 互斥:二者只能选一个。MTP 使用模型自身权重;DFlash 使用独立的草稿模型。
  2. MTP 与 TurboQuant KV 互斥:MTP 需要精确的注意力计算路径;TurboQuant 对 KV cache 做近似,会破坏 MTP 的草稿验证。
  3. 仅支持 Qwen3.5/3.6 架构:MTP 是 Qwen3.5/3.6 特有的功能,其他模型(Llama、Gemma 等)没有此特性。
  4. 需要 feat/mtp-batched 分支:PyPI 上的正式版 mlx-lm 尚不支持 MTP 推理。

相关链接

许可证

本模型遵循原始模型 Qwen/Qwen3.6-27B 的许可证:Apache 2.0

转换和上传仅供社区使用,不作任何明示或暗示的担保。


Tags: mlx, qwen, qwen3.6, mtp, multi-token-prediction, speculative-decoding, mlx-lm, apple-silicon, 4bit, quantized, local-llm, inference-acceleration

Keywords: Qwen3.6 MLX conversion, native MTP weights, speculative decoding, multi-token prediction, mlx-lm mtp branch, omlx, Apple Silicon inference, Qwen 27B 4bit, local LLM deployment

Downloads last month
1,488
Safetensors
Model size
27B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for samwang0041/Qwen3.6-27B-MLX-4bit-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(399)
this model