leonsarmiento/Ornith-1.0-35B-5bit-mlx

This model was converted to MLX format from deepreinforce-ai/Ornith-1.0-35B using mixed 5/8-bit quantization optimized for Apple Silicon. The vision encoder is preserved and quantized at 5-bit, making this a full multimodal model.

Ornith-1.0-35B is a 35B-parameter MoE (Mixture of Experts) model fine-tuned from Qwen3.5-35B-A3B by DeepReinforce AI, using a self-improving RL training framework that jointly optimizes scaffold and solution rollouts for agentic coding tasks. Despite 35B total parameters, only ~3B are activated per token. It features 256 experts (8 active per token + 1 shared expert), hybrid full + linear (Gated DeltaNet) attention, and a vision encoder.

Benchmark Highlights

Benchmark Ornith-1.0-35B Qwen3.5-35B Qwen3.6-35B
Terminal-Bench 2.1 (Terminus-2) 64.2 41.4 52.5
Terminal-Bench 2.1 (Claude Code) 62.8 38.9 49.2
SWE-bench Verified 75.6 70 73.4
SWE-bench Pro 50.4 44.6 49.5
SWE-bench Multilingual 69.3 60.3 67.2
NL2Repo 34.6 20.5 29.4
Claw-eval Avg 69.8 65.4 68.7

Use with mlx

pip install -U mlx-vlm
python -m mlx_vlm.generate --model leonsarmiento/Ornith-1.0-35B-5bit-mlx --max-tokens 256 --temperature 1.0 --top-p 1.0 --prompt "Hello"

Mixed Quantization Strategy

Bit Depth Layers Rationale
8-bit embed_tokens, lm_head, router gate, shared_expert_gate, shared_expert, self_attn (full attention), linear_attn (DeltaNet) Every token passes through these — routing accuracy, shared representation, and sequence modeling are non-negotiable
5-bit vision_tower, switch_mlp (routed experts) Bulk of parameters, only 8 of 256 experts active per token — natural redundancy tolerates lower precision

Quantization Details

Layer Bits Group Size
embed_tokens 8 64
lm_head 8 64
mlp.gate (router) 8 64
shared_expert_gate 8 64
shared_expert 8 64
self_attn (full attention) 8 64
linear_attn (DeltaNet) 8 64
vision_tower 5 64
switch_mlp (routed experts) 5 64
Default fallback 8 64
  • Quantization type: Mixed 5/8-bit (multimodal, vision preserved)
  • Group size: 64
  • Method: Custom quant_predicate via mlx_vlm

Recommended Inference Parameters

Parameter Value
temperature 1.0
top_p 1.0
top_k 40
min_p 0.01
repeat_penalty 1.05

Note: Ornith-1.0-35B uses Temp 1.0 and Top_p 1.0 per the model's Terminal-Bench 2.1 benchmark recipe. This is a Qwen3.5-based model — preserve_thinking is not applicable.

Downloads last month
2,205
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leonsarmiento/Ornith-1.0-35B-5bit-mlx

Quantized
(79)
this model