Instructions to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit

Run Hermes

hermes

MLX LM

How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.6-35B-A3B — combined trunk + MTP head — MLX 8-bit

A single-file MLX conversion of Qwen3.6-35B-A3B (a Qwen3-Next / MoE model, 256 experts, top-8, ~3B active params) that keeps the model's Multi-Token Prediction (MTP) head inline and fully intact, converted directly from the official Qwen bf16 weights. Built for lemon-mlx-engine on AMD ROCm.

Why this conversion is different

The MTP head is complete. Other MLX MTP conversions drop the head's 256 routed experts (keeping only the router + shared expert), which cripples draft quality. This conversion preserves all 256 head experts, so the speculative head actually drafts well — ~77% draft acceptance at n_draft=2 (greedy), vs near-useless drafts when the experts are missing.
Direct-from-Qwen, correct Qwen3-Next handling. The zero-centered RMSNorm (effective weight = stored + 1.0, applied to every norm incl. the three MTP head norms) and the conv1d weight layout are applied at convert time. Skipping these produces incoherent output.
One file: trunk + MTP together. Loadable by lemon-mlx-engine's one-file path — no separate draft model to manage.
Draft-fidelity precision. The tiny (~0.5 GB) MTP head is kept in bf16 regardless of trunk precision, since it is quant-sensitive and over-quantizing it lowers acceptance.

Variants (this org)

precision	size	repo
4-bit	~20 GB	LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-4bit
6-bit	~28 GB	LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-6bit
8-bit	~36 GB	LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit

Trunk weights are quantized to 8-bit; the MTP head stays bf16.

Performance (Radeon 8060S / gfx1151 APU, lemon-mlx-engine)

4-bit decode ≈ 40 tok/s, 8-bit ≈ 30 tok/s (no-MTP, greedy, ~1k ctx).
MTP draft acceptance is high at low draft counts (~77% @ n_draft=2, 4-bit; ~70% @ n_draft=2, 8-bit) and falls as draft length grows.
Note: on this MoE-A3B, MTP is roughly throughput-neutral — each draft token activates its own top-8 experts so the verification pass doesn't amortize the way it does for dense models (the same ceiling llama.cpp reports, ~1.2× on this class of APU). The value here is a correct, complete MTP head for speculative decoding and research, not a large speedup on this hardware.

Usage (lemon-mlx-engine)

# plain decode
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp=false
# speculative decode with the inline MTP head (n_draft=2 is the sweet spot here)
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp --n-draft 2

Requirements

These are lemon-mlx-engine models (combined trunk+MTP, Qwen3-Next handling baked in) and target AMD ROCm. The 6-bit variant in particular requires lemon-mlx-engine's fixed ROCm quantized-matmul kernel (stock builds without that fix mis-handle 6-bit packing and produce garbage; 4-bit and 8-bit are unaffected).

Provenance

Converted with lemon-mlx-engine's convert tool directly from the official Qwen/Qwen3.6-35B-A3B bf16 checkpoint. License inherited from the base model.

Downloads last month: 496

Safetensors

Model size

11B params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(143)

this model