Instructions to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "samwang0041/Qwen3.6-27B-MLX-4bit-MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default samwang0041/Qwen3.6-27B-MLX-4bit-MTP
Run Hermes
hermes
- MLX LM
How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samwang0041/Qwen3.6-27B-MLX-4bit-MTP", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.6-27B-MLX-4bit-MTP
English | 中文
Native MTP (Multi-Token Prediction) weights preserved for MLX inference. The official
mlx-communityreleases strip MTP weights duringsanitize(), breaking speculative decoding support.原生保留 MTP(多 token 预测)权重的 MLX 量化版本。官方
mlx-community发布版在sanitize()阶段会丢弃 MTP 权重,导致无法使用投机解码加速。
Introduction
This is a 4-bit quantized MLX conversion of Qwen/Qwen3.6-27B with native MTP (Multi-Token Prediction) weights fully preserved.
Most pre-converted MLX models on HuggingFace (including mlx-community releases) remove MTP weights during the sanitize() step, making it impossible to use Qwen3.6's built-in speculative decoding acceleration. This model fixes that.
Why This Matters
Qwen3.6 models include a dedicated MTP head (an extra transformer layer) trained to predict token t+2 from the hidden state at t and the embedding of t+1. When enabled during inference:
- Speedup: ~1.5-2x faster generation compared to standard autoregressive decoding
- Quality: Zero quality loss — MTP is trained end-to-end with the base model
- No draft model needed: Unlike DFlash or standard speculative decoding, MTP uses the model's own weights — no separate draft model to load or manage
Official MLX conversions strip these weights because mlx-lm's sanitize() method drops mtp.* keys. This repository preserves them.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.6-27B |
| Quantization | 4-bit affine, group_size=64 |
| Total Parameters | ~27B |
| Context Length | 262,144 |
| Architecture | Qwen3.5 (Gated DeltaNet hybrid) |
| MTP Layers | 1 |
| File Size | ~14.4GB (3 safetensors) |
| Compatible With | mlx-lm feat/mtp-batched branch |
Performance (M5 Max, 64GB RAM)
| Mode | Tokens/sec | Notes |
|---|---|---|
| Standard | ~30-35 | Baseline |
| MTP enabled | ~45-55 | ~1.5x speedup |
| MTP + DFlash | ❌ Incompatible | Cannot combine |
| MTP + TurboQuant KV | ❌ Incompatible | MTP needs exact attention path |
MTP acceptance rate observed: 74-84% in continuous conversation tests (15K+ tokens).
Required Branch / Version
You MUST use the feat/mtp-batched branch of mlx-lm. The official PyPI release and mlx-community conversions do NOT support MTP inference — they strip mtp.* weights during sanitize().
| Component | Required Path / Branch |
|---|---|
mlx-lm install |
git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched |
| Model on HuggingFace | samwang0041/Qwen3.6-27B-MLX-4bit-MTP |
| Original base model | Qwen/Qwen3.6-27B |
# Install the exact branch
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
Model Directory Structure
After mlx_lm.load() downloads or after manual conversion, the model directory contains:
qwen3.6-27b-4bit-mtp/
├── config.json # Model config (includes text_config.mtp_num_hidden_layers: 1)
├── tokenizer.json # BPE tokenizer
├── tokenizer_config.json
├── chat_template.jinja # Qwen3.6 chat template
├── model.safetensors.index.json # Weight map
├── model-00001-of-00003.safetensors # ~5.0GB (backbone layers 0-21)
├── model-00002-of-00003.safetensors # ~5.4GB (backbone layers 22-42 + lm_head)
└── model-00003-of-00003.safetensors # ~4.7GB (backbone layers 43-63 + MTP weights)
MTP weights are inside shard 3 under these keys (as seen in model.safetensors.index.json):
language_model.mtp.fc.weightlanguage_model.mtp.layers.0.input_layernorm.weightlanguage_model.mtp.layers.0.mlp.down_proj.weightlanguage_model.mtp.layers.0.mlp.gate_up_proj.weightlanguage_model.mtp.layers.0.post_attention_layernorm.weightlanguage_model.mtp.layers.0.self_attn.k_proj.weightlanguage_model.mtp.layers.0.self_attn.o_proj.weightlanguage_model.mtp.layers.0.self_attn.q_proj.weightlanguage_model.mtp.layers.0.self_attn.v_proj.weightlanguage_model.mtp.emb.weightlanguage_model.mtp.lm_head.weight- ... (31 total MTP weight keys)
How to Use
CLI generation:
python -m mlx_lm.generate \
--model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
--prompt "Explain quantum computing" \
--max-tokens 512 \
--mtp
Python API:
from mlx_lm import load, generate
# Auto-downloads from HuggingFace if not cached locally
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")
response = generate(
model, tokenizer,
prompt="Explain quantum computing",
max_tokens=512,
verbose=True,
mtp=True, # Enable MTP speculative decoding
)
Load from local path:
# If you downloaded/converted manually
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")
With omlx server:
Add to ~/.omlx/model_settings.json:
{
"qwen3.6-27b-4bit-mtp": {
"mtp_enabled": true,
"turboquant_kv_enabled": false,
"dflash_enabled": false,
"temperature": 1.0,
"top_p": 0.95,
"top_k": 20
}
}
The model directory should be placed at:
~/.omlx/models/qwen3.6-27b-4bit-mtp/
Conversion Process
This model was converted using the following steps:
# 1. Install mlx-lm with MTP support
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
# 2. Download original weights and convert with quantization
python -m mlx_lm convert \
--hf-path Qwen/Qwen3.6-27B \
--mlx-path qwen3.6-27b-4bit-mtp \
-q --q-bits 4 --q-group-size 64
# 3. The key: feat/mtp-batched branch preserves mtp.* weights during sanitize()
# Official mlx-lm releases strip them, breaking speculative decoding
The conversion was performed on an Apple M5 Max with 64GB RAM. Original BF16 weights (54GB) were loaded and quantized to 4-bit (14.4GB).
KV Cache Stability
Extensively tested with continuous multi-turn conversations (10+ rounds, 15K+ tokens):
- ✅ No KV cache corruption
- ✅ No SSM state rollback errors
- ✅ Stable MTP acceptance rates across long sessions
- ✅ Suitable for production use as an OpenAI-compatible API server
Important Notes
- MTP vs DFlash: These are mutually exclusive. MTP uses the model's own weights; DFlash uses a separate draft model. Pick one.
- MTP vs TurboQuant KV: Also mutually exclusive. MTP requires exact attention computation paths; TurboQuant approximates KV cache, breaking MTP's draft validation.
- Only Qwen3.5/3.6 architecture: MTP is specific to Qwen3.5/3.6's architecture. Other models (Llama, Gemma, etc.) do not have this feature.
- Requires feat/mtp-batched branch: The official PyPI
mlx-lmdoes not support MTP inference yet.
Related Links
- mlx-lm MTP PR #990 — Official MTP implementation PR
- feat/mtp-batched branch — The branch used for conversion
- Qwen3.6 Official Repo
- MLX Documentation
License
This model follows the same license as the original Qwen/Qwen3.6-27B: Apache 2.0
The conversion and upload are provided as-is for community use. No warranties expressed or implied.
简介
这是 Qwen/Qwen3.6-27B 的 4-bit MLX 量化版本,完整保留了原生 MTP(多 Token 预测)权重。
HuggingFace 上大多数预转换的 MLX 模型(包括 mlx-community 发布版)在 sanitize() 阶段会移除 MTP 权重,导致无法使用 Qwen3.6 内置的投机解码加速。这个模型解决了这个问题。
为什么重要
Qwen3.6 模型包含一个专门的 MTP Head(额外的一个 Transformer 层),训练目标是从位置 t 的隐藏状态预测 token t+2。启用后:
- 速度提升:相比标准自回归解码,生成速度提升约 1.5-2 倍
- 零质量损失:MTP 与基础模型端到端联合训练,不会降低输出质量
- 无需草稿模型:与 DFlash 或标准投机解码不同,MTP 使用模型自身的权重,不需要加载单独的草稿模型
官方 MLX 转换会丢弃这些权重,因为 mlx-lm 的 sanitize() 方法会过滤掉 mtp.* 键。本仓库完整保留了它们。
模型详情
| 属性 | 值 |
|---|---|
| 基础模型 | Qwen/Qwen3.6-27B |
| 量化方式 | 4-bit affine, group_size=64 |
| 总参数量 | ~27B |
| 上下文长度 | 262,144 |
| 架构 | Qwen3.5(Gated DeltaNet 混合) |
| MTP 层数 | 1 |
| 文件大小 | ~14.4GB(3 个 safetensors) |
| 兼容环境 | mlx-lm feat/mtp-batched 分支 |
性能表现(M5 Max, 64GB 内存)
| 模式 | Tokens/秒 | 说明 |
|---|---|---|
| 标准解码 | ~30-35 | 基线 |
| 启用 MTP | ~45-55 | ~1.5 倍加速 |
| MTP + DFlash | ❌ 不兼容 | 不能同时开启 |
| MTP + TurboQuant KV | ❌ 不兼容 | MTP 需要精确的注意力路径 |
在连续对话测试中(15K+ tokens),MTP 接受率稳定在 **74-84%**。
必须使用的分支 / 版本
必须使用 mlx-lm 的 feat/mtp-batched 分支。 PyPI 正式版和 mlx-community 的转换模型不支持 MTP 推理 — 它们会在 sanitize() 阶段丢弃 mtp.* 权重。
| 组件 | 所需路径 / 分支 |
|---|---|
mlx-lm 安装 |
git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched |
| HuggingFace 模型 | samwang0041/Qwen3.6-27B-MLX-4bit-MTP |
| 原始基础模型 | Qwen/Qwen3.6-27B |
# 安装指定分支
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
模型目录结构
mlx_lm.load() 自动下载或手动转换后,模型目录结构如下:
qwen3.6-27b-4bit-mtp/
├── config.json # 模型配置(包含 text_config.mtp_num_hidden_layers: 1)
├── tokenizer.json # BPE 分词器
├── tokenizer_config.json
├── chat_template.jinja # Qwen3.6 对话模板
├── model.safetensors.index.json # 权重映射表
├── model-00001-of-00003.safetensors # ~5.0GB(主干层 0-21)
├── model-00002-of-00003.safetensors # ~5.4GB(主干层 22-42 + lm_head)
└── model-00003-of-00003.safetensors # ~4.7GB(主干层 43-63 + MTP 权重)
MTP 权重位于分片 3 中,键名如下(可在 model.safetensors.index.json 中查看):
language_model.mtp.fc.weightlanguage_model.mtp.layers.0.input_layernorm.weightlanguage_model.mtp.layers.0.mlp.down_proj.weightlanguage_model.mtp.layers.0.mlp.gate_up_proj.weightlanguage_model.mtp.layers.0.post_attention_layernorm.weightlanguage_model.mtp.layers.0.self_attn.k_proj.weightlanguage_model.mtp.layers.0.self_attn.o_proj.weightlanguage_model.mtp.layers.0.self_attn.q_proj.weightlanguage_model.mtp.layers.0.self_attn.v_proj.weightlanguage_model.mtp.emb.weightlanguage_model.mtp.lm_head.weight- ...(共 31 个 MTP 权重键)
使用方法
命令行生成:
python -m mlx_lm.generate \
--model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
--prompt "解释量子计算" \
--max-tokens 512 \
--mtp
Python API:
from mlx_lm import load, generate
# 如未本地缓存,自动从 HuggingFace 下载
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")
response = generate(
model, tokenizer,
prompt="解释量子计算",
max_tokens=512,
verbose=True,
mtp=True, # 启用 MTP 投机解码
)
从本地路径加载:
# 如果你手动下载或转换过
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")
配合 omlx 服务器:
在 ~/.omlx/model_settings.json 中添加:
{
"qwen3.6-27b-4bit-mtp": {
"mtp_enabled": true,
"turboquant_kv_enabled": false,
"dflash_enabled": false,
"temperature": 1.0,
"top_p": 0.95,
"top_k": 20
}
}
模型目录应放置在:
~/.omlx/models/qwen3.6-27b-4bit-mtp/
转换过程
本模型通过以下步骤转换:
# 1. 安装支持 MTP 的 mlx-lm
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched
# 2. 下载原始权重并进行量化转换
python -m mlx_lm convert \
--hf-path Qwen/Qwen3.6-27B \
--mlx-path qwen3.6-27b-4bit-mtp \
-q --q-bits 4 --q-group-size 64
# 3. 关键点:feat/mtp-batched 分支在 sanitize() 时保留了 mtp.* 权重
# 官方 mlx-lm 发布版会丢弃它们,导致投机解码失效
转换在 Apple M5 Max(64GB 内存)上完成。原始 BF16 权重约 54GB,量化后约 14.4GB。
KV Cache 稳定性
经过持续多轮对话的充分测试(10+ 轮,15K+ tokens):
- ✅ 无 KV cache 损坏
- ✅ 无 SSM 状态回滚错误
- ✅ 长会话中 MTP 接受率稳定
- ✅ 可作为 OpenAI 兼容 API 服务器用于生产环境
重要提示
- MTP 与 DFlash 互斥:二者只能选一个。MTP 使用模型自身权重;DFlash 使用独立的草稿模型。
- MTP 与 TurboQuant KV 互斥:MTP 需要精确的注意力计算路径;TurboQuant 对 KV cache 做近似,会破坏 MTP 的草稿验证。
- 仅支持 Qwen3.5/3.6 架构:MTP 是 Qwen3.5/3.6 特有的功能,其他模型(Llama、Gemma 等)没有此特性。
- 需要 feat/mtp-batched 分支:PyPI 上的正式版
mlx-lm尚不支持 MTP 推理。
相关链接
- mlx-lm MTP PR #990 — 官方 MTP 实现 PR
- feat/mtp-batched 分支 — 转换使用的分支
- Qwen3.6 官方仓库
- MLX 文档
许可证
本模型遵循原始模型 Qwen/Qwen3.6-27B 的许可证:Apache 2.0
转换和上传仅供社区使用,不作任何明示或暗示的担保。
Tags: mlx, qwen, qwen3.6, mtp, multi-token-prediction, speculative-decoding, mlx-lm, apple-silicon, 4bit, quantized, local-llm, inference-acceleration
Keywords: Qwen3.6 MLX conversion, native MTP weights, speculative decoding, multi-token prediction, mlx-lm mtp branch, omlx, Apple Silicon inference, Qwen 27B 4bit, local LLM deployment
- Downloads last month
- 1,488
4-bit
Model tree for samwang0041/Qwen3.6-27B-MLX-4bit-MTP
Base model
Qwen/Qwen3.6-27B