Instructions to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default samwang0041/Qwen3.6-27B-MLX-4bit-MTP

Run Hermes

hermes

MLX LM

How to use samwang0041/Qwen3.6-27B-MLX-4bit-MTP with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "samwang0041/Qwen3.6-27B-MLX-4bit-MTP"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "samwang0041/Qwen3.6-27B-MLX-4bit-MTP",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6-27B-MLX-4bit-MTP

English | 中文

Native MTP (Multi-Token Prediction) weights preserved for MLX inference. The official mlx-community releases strip MTP weights during sanitize(), breaking speculative decoding support.

原生保留 MTP（多 token 预测）权重的 MLX 量化版本。官方 mlx-community 发布版在 sanitize() 阶段会丢弃 MTP 权重，导致无法使用投机解码加速。

Introduction

This is a 4-bit quantized MLX conversion of Qwen/Qwen3.6-27B with native MTP (Multi-Token Prediction) weights fully preserved.

Most pre-converted MLX models on HuggingFace (including mlx-community releases) remove MTP weights during the sanitize() step, making it impossible to use Qwen3.6's built-in speculative decoding acceleration. This model fixes that.

Why This Matters

Qwen3.6 models include a dedicated MTP head (an extra transformer layer) trained to predict token t+2 from the hidden state at t and the embedding of t+1. When enabled during inference:

Speedup: ~1.5-2x faster generation compared to standard autoregressive decoding
Quality: Zero quality loss — MTP is trained end-to-end with the base model
No draft model needed: Unlike DFlash or standard speculative decoding, MTP uses the model's own weights — no separate draft model to load or manage

Official MLX conversions strip these weights because mlx-lm's sanitize() method drops mtp.* keys. This repository preserves them.

Model Details

Property	Value
Base Model	`Qwen/Qwen3.6-27B`
Quantization	4-bit affine, group_size=64
Total Parameters	~27B
Context Length	262,144
Architecture	Qwen3.5 (Gated DeltaNet hybrid)
MTP Layers	1
File Size	~14.4GB (3 safetensors)
Compatible With	`mlx-lm` feat/mtp-batched branch

Performance (M5 Max, 64GB RAM)

Mode	Tokens/sec	Notes
Standard	~30-35	Baseline
MTP enabled	~45-55	~1.5x speedup
MTP + DFlash	❌ Incompatible	Cannot combine
MTP + TurboQuant KV	❌ Incompatible	MTP needs exact attention path

MTP acceptance rate observed: 74-84% in continuous conversation tests (15K+ tokens).

Required Branch / Version

You MUST use the feat/mtp-batched branch of mlx-lm. The official PyPI release and mlx-community conversions do NOT support MTP inference — they strip mtp.* weights during sanitize().

Component	Required Path / Branch
`mlx-lm` install	`git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched`
Model on HuggingFace	`samwang0041/Qwen3.6-27B-MLX-4bit-MTP`
Original base model	`Qwen/Qwen3.6-27B`

# Install the exact branch
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

Model Directory Structure

After mlx_lm.load() downloads or after manual conversion, the model directory contains:

qwen3.6-27b-4bit-mtp/
├── config.json                          # Model config (includes text_config.mtp_num_hidden_layers: 1)
├── tokenizer.json                       # BPE tokenizer
├── tokenizer_config.json
├── chat_template.jinja                  # Qwen3.6 chat template
├── model.safetensors.index.json         # Weight map
├── model-00001-of-00003.safetensors     # ~5.0GB (backbone layers 0-21)
├── model-00002-of-00003.safetensors     # ~5.4GB (backbone layers 22-42 + lm_head)
└── model-00003-of-00003.safetensors     # ~4.7GB (backbone layers 43-63 + MTP weights)

MTP weights are inside shard 3 under these keys (as seen in model.safetensors.index.json):

language_model.mtp.fc.weight
language_model.mtp.layers.0.input_layernorm.weight
language_model.mtp.layers.0.mlp.down_proj.weight
language_model.mtp.layers.0.mlp.gate_up_proj.weight
language_model.mtp.layers.0.post_attention_layernorm.weight
language_model.mtp.layers.0.self_attn.k_proj.weight
language_model.mtp.layers.0.self_attn.o_proj.weight
language_model.mtp.layers.0.self_attn.q_proj.weight
language_model.mtp.layers.0.self_attn.v_proj.weight
language_model.mtp.emb.weight
language_model.mtp.lm_head.weight
... (31 total MTP weight keys)

How to Use

CLI generation:

python -m mlx_lm.generate \
  --model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
  --prompt "Explain quantum computing" \
  --max-tokens 512 \
  --mtp

Python API:

from mlx_lm import load, generate

# Auto-downloads from HuggingFace if not cached locally
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")

response = generate(
    model, tokenizer,
    prompt="Explain quantum computing",
    max_tokens=512,
    verbose=True,
    mtp=True,  # Enable MTP speculative decoding
)

Load from local path:

# If you downloaded/converted manually
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")

With omlx server:

Add to ~/.omlx/model_settings.json:

{
  "qwen3.6-27b-4bit-mtp": {
    "mtp_enabled": true,
    "turboquant_kv_enabled": false,
    "dflash_enabled": false,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 20
  }
}

The model directory should be placed at:

~/.omlx/models/qwen3.6-27b-4bit-mtp/

Conversion Process

This model was converted using the following steps:

# 1. Install mlx-lm with MTP support
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

# 2. Download original weights and convert with quantization
python -m mlx_lm convert \
  --hf-path Qwen/Qwen3.6-27B \
  --mlx-path qwen3.6-27b-4bit-mtp \
  -q --q-bits 4 --q-group-size 64

# 3. The key: feat/mtp-batched branch preserves mtp.* weights during sanitize()
#    Official mlx-lm releases strip them, breaking speculative decoding

The conversion was performed on an Apple M5 Max with 64GB RAM. Original BF16 weights (~~54GB) were loaded and quantized to 4-bit (~~14.4GB).

KV Cache Stability

Extensively tested with continuous multi-turn conversations (10+ rounds, 15K+ tokens):

✅ No KV cache corruption
✅ No SSM state rollback errors
✅ Stable MTP acceptance rates across long sessions
✅ Suitable for production use as an OpenAI-compatible API server

Important Notes

MTP vs DFlash: These are mutually exclusive. MTP uses the model's own weights; DFlash uses a separate draft model. Pick one.
MTP vs TurboQuant KV: Also mutually exclusive. MTP requires exact attention computation paths; TurboQuant approximates KV cache, breaking MTP's draft validation.
Only Qwen3.5/3.6 architecture: MTP is specific to Qwen3.5/3.6's architecture. Other models (Llama, Gemma, etc.) do not have this feature.
Requires feat/mtp-batched branch: The official PyPI mlx-lm does not support MTP inference yet.

License

This model follows the same license as the original Qwen/Qwen3.6-27B: Apache 2.0

The conversion and upload are provided as-is for community use. No warranties expressed or implied.

简介

这是 Qwen/Qwen3.6-27B 的 4-bit MLX 量化版本，完整保留了原生 MTP（多 Token 预测）权重。

HuggingFace 上大多数预转换的 MLX 模型（包括 mlx-community 发布版）在 sanitize() 阶段会移除 MTP 权重，导致无法使用 Qwen3.6 内置的投机解码加速。这个模型解决了这个问题。

为什么重要

Qwen3.6 模型包含一个专门的 MTP Head（额外的一个 Transformer 层），训练目标是从位置 t 的隐藏状态预测 token t+2。启用后：

速度提升：相比标准自回归解码，生成速度提升约 1.5-2 倍
零质量损失：MTP 与基础模型端到端联合训练，不会降低输出质量
无需草稿模型：与 DFlash 或标准投机解码不同，MTP 使用模型自身的权重，不需要加载单独的草稿模型

官方 MLX 转换会丢弃这些权重，因为 mlx-lm 的 sanitize() 方法会过滤掉 mtp.* 键。本仓库完整保留了它们。

模型详情

属性	值
基础模型	`Qwen/Qwen3.6-27B`
量化方式	4-bit affine, group_size=64
总参数量	~27B
上下文长度	262,144
架构	Qwen3.5（Gated DeltaNet 混合）
MTP 层数	1
文件大小	~14.4GB（3 个 safetensors）
兼容环境	`mlx-lm` feat/mtp-batched 分支

性能表现（M5 Max, 64GB 内存）

模式	Tokens/秒	说明
标准解码	~30-35	基线
启用 MTP	~45-55	~1.5 倍加速
MTP + DFlash	❌ 不兼容	不能同时开启
MTP + TurboQuant KV	❌ 不兼容	MTP 需要精确的注意力路径

在连续对话测试中（15K+ tokens），MTP 接受率稳定在 **74-84%**。

必须使用的分支 / 版本

必须使用 mlx-lm 的 feat/mtp-batched 分支。 PyPI 正式版和 mlx-community 的转换模型不支持 MTP 推理 — 它们会在 sanitize() 阶段丢弃 mtp.* 权重。

组件	所需路径 / 分支
`mlx-lm` 安装	`git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched`
HuggingFace 模型	`samwang0041/Qwen3.6-27B-MLX-4bit-MTP`
原始基础模型	`Qwen/Qwen3.6-27B`

# 安装指定分支
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

模型目录结构

mlx_lm.load() 自动下载或手动转换后，模型目录结构如下：

qwen3.6-27b-4bit-mtp/
├── config.json                          # 模型配置（包含 text_config.mtp_num_hidden_layers: 1）
├── tokenizer.json                       # BPE 分词器
├── tokenizer_config.json
├── chat_template.jinja                  # Qwen3.6 对话模板
├── model.safetensors.index.json         # 权重映射表
├── model-00001-of-00003.safetensors     # ~5.0GB（主干层 0-21）
├── model-00002-of-00003.safetensors     # ~5.4GB（主干层 22-42 + lm_head）
└── model-00003-of-00003.safetensors     # ~4.7GB（主干层 43-63 + MTP 权重）

MTP 权重位于分片 3 中，键名如下（可在 model.safetensors.index.json 中查看）：

language_model.mtp.fc.weight
language_model.mtp.layers.0.input_layernorm.weight
language_model.mtp.layers.0.mlp.down_proj.weight
language_model.mtp.layers.0.mlp.gate_up_proj.weight
language_model.mtp.layers.0.post_attention_layernorm.weight
language_model.mtp.layers.0.self_attn.k_proj.weight
language_model.mtp.layers.0.self_attn.o_proj.weight
language_model.mtp.layers.0.self_attn.q_proj.weight
language_model.mtp.layers.0.self_attn.v_proj.weight
language_model.mtp.emb.weight
language_model.mtp.lm_head.weight
...（共 31 个 MTP 权重键）

使用方法

命令行生成：

python -m mlx_lm.generate \
  --model samwang0041/Qwen3.6-27B-MLX-4bit-MTP \
  --prompt "解释量子计算" \
  --max-tokens 512 \
  --mtp

Python API：

from mlx_lm import load, generate

# 如未本地缓存，自动从 HuggingFace 下载
model, tokenizer = load("samwang0041/Qwen3.6-27B-MLX-4bit-MTP")

response = generate(
    model, tokenizer,
    prompt="解释量子计算",
    max_tokens=512,
    verbose=True,
    mtp=True,  # 启用 MTP 投机解码
)

从本地路径加载：

# 如果你手动下载或转换过
model, tokenizer = load("/path/to/qwen3.6-27b-4bit-mtp")

配合 omlx 服务器：

在 ~/.omlx/model_settings.json 中添加：

{
  "qwen3.6-27b-4bit-mtp": {
    "mtp_enabled": true,
    "turboquant_kv_enabled": false,
    "dflash_enabled": false,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 20
  }
}

模型目录应放置在：

~/.omlx/models/qwen3.6-27b-4bit-mtp/

转换过程

本模型通过以下步骤转换：

# 1. 安装支持 MTP 的 mlx-lm
pip install git+https://github.com/AirRunner/mlx-lm.git@feat/mtp-batched

# 2. 下载原始权重并进行量化转换
python -m mlx_lm convert \
  --hf-path Qwen/Qwen3.6-27B \
  --mlx-path qwen3.6-27b-4bit-mtp \
  -q --q-bits 4 --q-group-size 64

# 3. 关键点：feat/mtp-batched 分支在 sanitize() 时保留了 mtp.* 权重
#    官方 mlx-lm 发布版会丢弃它们，导致投机解码失效

转换在 Apple M5 Max（64GB 内存）上完成。原始 BF16 权重约 54GB，量化后约 14.4GB。

KV Cache 稳定性

经过持续多轮对话的充分测试（10+ 轮，15K+ tokens）：

✅ 无 KV cache 损坏
✅ 无 SSM 状态回滚错误
✅ 长会话中 MTP 接受率稳定
✅ 可作为 OpenAI 兼容 API 服务器用于生产环境

重要提示

MTP 与 DFlash 互斥：二者只能选一个。MTP 使用模型自身权重；DFlash 使用独立的草稿模型。
MTP 与 TurboQuant KV 互斥：MTP 需要精确的注意力计算路径；TurboQuant 对 KV cache 做近似，会破坏 MTP 的草稿验证。
仅支持 Qwen3.5/3.6 架构：MTP 是 Qwen3.5/3.6 特有的功能，其他模型（Llama、Gemma 等）没有此特性。
需要 feat/mtp-batched 分支：PyPI 上的正式版 mlx-lm 尚不支持 MTP 推理。

许可证

本模型遵循原始模型 Qwen/Qwen3.6-27B 的许可证：Apache 2.0

转换和上传仅供社区使用，不作任何明示或暗示的担保。

Tags: mlx, qwen, qwen3.6, mtp, multi-token-prediction, speculative-decoding, mlx-lm, apple-silicon, 4bit, quantized, local-llm, inference-acceleration

Keywords: Qwen3.6 MLX conversion, native MTP weights, speculative decoding, multi-token prediction, mlx-lm mtp branch, omlx, Apple Silicon inference, Qwen 27B 4bit, local LLM deployment

Downloads last month: 1,488

Safetensors

Model size

27B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for samwang0041/Qwen3.6-27B-MLX-4bit-MTP

Base model

Qwen/Qwen3.6-27B

Quantized

(399)

this model

samwang0041
/

Qwen3.6-27B-MLX-4bit-MTP