Instructions to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit
Run Hermes
hermes
- MLX LM
How to use LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.6-35B-A3B — combined trunk + MTP head — MLX 8-bit
A single-file MLX conversion of Qwen3.6-35B-A3B (a Qwen3-Next / MoE model, 256 experts, top-8, ~3B active params) that keeps the model's Multi-Token Prediction (MTP) head inline and fully intact, converted directly from the official Qwen bf16 weights. Built for lemon-mlx-engine on AMD ROCm.
Why this conversion is different
- The MTP head is complete. Other MLX MTP conversions drop the head's 256 routed experts (keeping only the router + shared expert), which cripples draft quality. This conversion preserves all 256 head experts, so the speculative head actually drafts well — ~77% draft acceptance at n_draft=2 (greedy), vs near-useless drafts when the experts are missing.
- Direct-from-Qwen, correct Qwen3-Next handling. The zero-centered RMSNorm (effective weight = stored + 1.0, applied to every norm incl. the three MTP head norms) and the conv1d weight layout are applied at convert time. Skipping these produces incoherent output.
- One file: trunk + MTP together. Loadable by lemon-mlx-engine's one-file path — no separate draft model to manage.
- Draft-fidelity precision. The tiny (~0.5 GB) MTP head is kept in bf16 regardless of trunk precision, since it is quant-sensitive and over-quantizing it lowers acceptance.
Variants (this org)
| precision | size | repo |
|---|---|---|
| 4-bit | ~20 GB | LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-4bit |
| 6-bit | ~28 GB | LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-6bit |
| 8-bit | ~36 GB | LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit |
Trunk weights are quantized to 8-bit; the MTP head stays bf16.
Performance (Radeon 8060S / gfx1151 APU, lemon-mlx-engine)
- 4-bit decode ≈ 40 tok/s, 8-bit ≈ 30 tok/s (no-MTP, greedy, ~1k ctx).
- MTP draft acceptance is high at low draft counts (~77% @ n_draft=2, 4-bit; ~70% @ n_draft=2, 8-bit) and falls as draft length grows.
- Note: on this MoE-A3B, MTP is roughly throughput-neutral — each draft token activates its own top-8 experts so the verification pass doesn't amortize the way it does for dense models (the same ceiling llama.cpp reports, ~1.2× on this class of APU). The value here is a correct, complete MTP head for speculative decoding and research, not a large speedup on this hardware.
Usage (lemon-mlx-engine)
# plain decode
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp=false
# speculative decode with the inline MTP head (n_draft=2 is the sweet spot here)
chat LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit --use-mtp --n-draft 2
Requirements
These are lemon-mlx-engine models (combined trunk+MTP, Qwen3-Next handling baked in) and target AMD ROCm. The 6-bit variant in particular requires lemon-mlx-engine's fixed ROCm quantized-matmul kernel (stock builds without that fix mis-handle 6-bit packing and produce garbage; 4-bit and 8-bit are unaffected).
Provenance
Converted with lemon-mlx-engine's convert tool directly from the official
Qwen/Qwen3.6-35B-A3B bf16 checkpoint. License inherited from the base model.
- Downloads last month
- 496
Quantized
Model tree for LemonMLXE/Qwen3.6-35B-A3B-MTP-mlx-8bit
Base model
Qwen/Qwen3.6-35B-A3B