Ornith-1.0-35B-MTP-GGUF

Ornith-1.0-35B (qwen35moe, 35B-A3B, Qwen3.5 base) is a strong agentic-coding MoE that ships without MTP heads. This GGUF has an MTP head grafted in so it can use llama.cpp's --spec-type draft-mtp self-speculative decoding for a real speedup, with no quality change to the base weights.

Quantization: Q6_K (body + grafted MTP head).

Performance (M3 Max, measured, real-prompt benchmark)

Generation speed (tg128) on a real code-continuation prompt, sweeping draft depth:

Mode tok/s Speedup MTP acceptance mean accepted len
AR (no MTP) 66.6 1.00×
draft-mtp n_max=1 83.8 1.26× 92.2% 1.92
draft-mtp n_max=2 82.8 1.24× 82.5% 2.65
draft-mtp n_max=3 81.7 1.23× 78.2% 3.35
draft-mtp n_max=4 75.9 1.14× 68.1% 3.72

Best: --spec-draft-n-max 1, ~1.26×. (Acceptance is much higher on real text than on random tokens — benchmark with a real prompt or you'll badly underestimate MTP.)

Usage (llama.cpp)

llama-server -m Ornith-1.0-35B-Q6_K-MTP.gguf -ngl 99 -c 32768 \
  --spec-type draft-mtp --spec-draft-n-max 1 --port 8080

Requires a llama.cpp build with draft-mtp speculative support.

Provenance & licensing

Released under MIT. No weights retrained — this is a head graft + metadata patch.

Downloads last month
206
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wang-yang/Ornith-1.0-35B-MTP-GGUF

Quantized
(76)
this model