Instructions to use wang-yang/Ornith-1.0-35B-MTPLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use wang-yang/Ornith-1.0-35B-MTPLX with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("wang-yang/Ornith-1.0-35B-MTPLX") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use wang-yang/Ornith-1.0-35B-MTPLX with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "wang-yang/Ornith-1.0-35B-MTPLX"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "wang-yang/Ornith-1.0-35B-MTPLX" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use wang-yang/Ornith-1.0-35B-MTPLX with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "wang-yang/Ornith-1.0-35B-MTPLX"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default wang-yang/Ornith-1.0-35B-MTPLX
Run Hermes
hermes
- MLX LM
How to use wang-yang/Ornith-1.0-35B-MTPLX with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "wang-yang/Ornith-1.0-35B-MTPLX"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "wang-yang/Ornith-1.0-35B-MTPLX" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wang-yang/Ornith-1.0-35B-MTPLX", "messages": [ {"role": "user", "content": "Hello"} ] }'
Ornith-1.0-35B-MTPLX
A multi-token-prediction (MTP) graft onto deepreinforce-ai/Ornith-1.0-35B, packaged for MTPLX native speculative decoding on Apple Silicon.
Ornith-1.0-35B (qwen35moe, 35B-A3B, Qwen3.5 base) is a strong agentic-coding MoE but ships without MTP heads. This build grafts the official 1-layer MoE-MTP from Qwen/Qwen3.5-35B-A3B (the upstream base of Ornith — dimensions match exactly) and quantizes the MTP to 4-bit, which is what makes the speedup practical (see below).
Performance (M3 Max, measured)
| Mode | tok/s | Speedup | MTP acceptance |
|---|---|---|---|
| AR (no MTP) | 76.3 | 1.00× | — |
| MTP depth 1 | 103.6 | 1.36× | 89.6% |
| MTP depth 2 | 114.9 | 1.50× | 93.1% / 78.2% |
| MTP depth 3 | 116.4 | 1.53× | 91.5% / 80.3% / 65.6% |
verdict: mtp_depth_wins · MTPLX inspect tier: verified
Why 4-bit MTP (not bf16)
A bf16 graft of this same MTP layer was a net ~20× slowdown despite 92% acceptance — the MTP is a full
256-expert MoE layer, and at bf16 its draft forward costs 67–143 ms/token (it doesn't hit the fast MoE kernel).
Quantizing the MTP experts to 4-bit affine drops the draft cost to ~2.5 ms/token (27–51×) with negligible
acceptance loss, flipping the result to a real 1.53× speedup. The bottleneck was draft cost, not acceptance.
Quantization
- Body: 4-bit affine, group size 64
- MTP sidecar: 4-bit affine, group size 64 (applied at load via
mtplx_mtp_quantizationinconfig.json) - Architecture:
Qwen3_5MoeForConditionalGeneration/ MTPLX arch_idqwen3-next-mtp
Usage (MTPLX)
mtplx tune --model <path-to-this-model> # confirms best MTP depth (D3, ~1.53×)
mtplx start --model <path-to-this-model> # serve with MTP speculative decoding
Provenance & licensing
- Base model: deepreinforce-ai/Ornith-1.0-35B — MIT
- MTP source: Qwen/Qwen3.5-35B-A3B — Apache-2.0
This derivative is released under MIT, preserving the base model's license. The grafted MTP tensors originate from the Apache-2.0 licensed Qwen3.5-35B-A3B; that license and its NOTICE apply to those tensors. No weights were retrained — this is a graft + quantization repackaging.
Graft notes (reproducibility)
- MTP
mtp.*tensors lifted from Qwen3.5-35B-A3B (785 tensors:mtp.fc+mtp.layers.0.*with 256 experts). - Qwen3.5 RMSNorm uses delta encoding → MTP sidecar carries
mtplx_mtp_norm_encoding="delta". - Forged with MTPLX
forge(mtp_policy=requantize,--allow-degraded-mtp); contract calibratedexact_agreement.
- Downloads last month
- 199
4-bit