Instructions to use OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP") config = load_config("OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP
Run Hermes
hermes

Qwen3.6-35B-A3B-MXFP4-MTP
Qwen3.6-35B-A3B quantized to native MXFP4 for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled.
| Source | Qwen/Qwen3.6-35B-A3B |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP4 (mx.quantize, affine, group_size=32) |
| Architecture | qwen3_5_moe — 40 layers, 256 routed experts, top-8, ~3B active |
| Modality | image + video + text |
| Context | 262,144 |
| Bundle size | 21.53 GB |
| MTP | native head preserved, enabled (num_nextn_predict_layers=1) |
Quantization
4-bit affine linears via MLX-native mx.quantize (mode="mxfp4",
group_size=32). Norms, router gates, expert biases and the full vision
tower are kept in fp16 passthrough (643 passthrough tensors). MTP linears
are quantized to MXFP4; MTP norm/control tensors stay fp16. This is the
smallest bundle in the MoE line — the same model as the MXFP8 variant at
roughly 60% of the size.
Multi-Token Prediction
This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.
Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):
| Draft depth | tok/s | Speedup |
|---|---|---|
| Baseline (MTP off) | 83.9 | 1.00× |
| D1 | 108.8 | 1.30× |
| D2 | 126.0 | 1.50× |
| D3 (default) | 131.2 | 1.56× |
Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.
Vision, MTP and caching together
This bundle preserves the full Qwen3.6 VL tower alongside the native MTP head, so image/video input, MTP speculative decode and prefix/KV caching all run in the same session — a combination not every MTP-enabled Qwen build exposes. The VL stack is the same one verified on the MXFP8 sibling.
Loading
Loads via stock MLX tooling on Apple Silicon — the mxfp4 weights are
native mx.quantize affine, no JANG runtime required for the core model.
from mlx_vlm import load, generate
model, processor = load("OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP")
The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.
Variants
| Variant | Arch | Format | Size | Best MTP speedup |
|---|---|---|---|---|
| Qwen3.6-27B-MXFP4-MTP | dense | mxfp4 | 14.4 GB | 1.85× (D2) |
| Qwen3.6-27B-MXFP8-MTP | dense | mxfp8 | 27.1 GB | 1.83× (D3) |
| Qwen3.6-35B-A3B-MXFP4-MTP (this) | MoE | mxfp4 | 21.5 GB | 1.56× (D3) |
| Qwen3.6-35B-A3B-MXFP8-MTP | MoE | mxfp8 | 35.0 GB | 1.71× (D3) |
Credits
- Quantization toolchain: JANG by Jinho Jang <eric@osaurus.ai>
- Base model: Qwen3.6-35B-A3B by Qwen
- Downloads last month
- 309
Quantized
Model tree for OsaurusAI/Qwen3.6-35B-A3B-MXFP4-MTP
Base model
Qwen/Qwen3.6-35B-A3B