leonsarmiento/Ornith-1.0-35B-5bit-mlx

This model was converted to MLX format from deepreinforce-ai/Ornith-1.0-35B using mixed 5/8-bit quantization optimized for Apple Silicon. The vision encoder is preserved and quantized at 5-bit, making this a full multimodal model.

Ornith-1.0-35B is a 35B-parameter MoE (Mixture of Experts) model fine-tuned from Qwen3.5-35B-A3B by DeepReinforce AI, using a self-improving RL training framework that jointly optimizes scaffold and solution rollouts for agentic coding tasks. Despite 35B total parameters, only ~3B are activated per token. It features 256 experts (8 active per token + 1 shared expert), hybrid full + linear (Gated DeltaNet) attention, and a vision encoder.

Benchmark Highlights

Benchmark	Ornith-1.0-35B	Qwen3.5-35B	Qwen3.6-35B
Terminal-Bench 2.1 (Terminus-2)	64.2	41.4	52.5
Terminal-Bench 2.1 (Claude Code)	62.8	38.9	49.2
SWE-bench Verified	75.6	70	73.4
SWE-bench Pro	50.4	44.6	49.5
SWE-bench Multilingual	69.3	60.3	67.2
NL2Repo	34.6	20.5	29.4
Claw-eval Avg	69.8	65.4	68.7

Use with mlx

pip install -U mlx-vlm

python -m mlx_vlm.generate --model leonsarmiento/Ornith-1.0-35B-5bit-mlx --max-tokens 256 --temperature 1.0 --top-p 1.0 --prompt "Hello"

Mixed Quantization Strategy

Bit Depth	Layers	Rationale
8-bit	`embed_tokens`, `lm_head`, router `gate`, `shared_expert_gate`, `shared_expert`, `self_attn` (full attention), `linear_attn` (DeltaNet)	Every token passes through these — routing accuracy, shared representation, and sequence modeling are non-negotiable
5-bit	`vision_tower`, `switch_mlp` (routed experts)	Bulk of parameters, only 8 of 256 experts active per token — natural redundancy tolerates lower precision

Quantization Details

Layer	Bits	Group Size
`embed_tokens`	8	64
`lm_head`	8	64
`mlp.gate` (router)	8	64
`shared_expert_gate`	8	64
`shared_expert`	8	64
`self_attn` (full attention)	8	64
`linear_attn` (DeltaNet)	8	64
`vision_tower`	5	64
`switch_mlp` (routed experts)	5	64
Default fallback	8	64

Quantization type: Mixed 5/8-bit (multimodal, vision preserved)
Group size: 64
Method: Custom quant_predicate via mlx_vlm

Recommended Inference Parameters

Parameter	Value
`temperature`	1.0
`top_p`	1.0
`top_k`	40
`min_p`	0.01
`repeat_penalty`	1.05

Note: Ornith-1.0-35B uses Temp 1.0 and Top_p 1.0 per the model's Terminal-Bench 2.1 benchmark recipe. This is a Qwen3.5-based model — preserve_thinking is not applicable.

Downloads last month: 2,205

Safetensors

Model size

7B params

Tensor type

BF16

U32

MLX

Hardware compatibility

5-bit

Model tree for leonsarmiento/Ornith-1.0-35B-5bit-mlx

Base model

deepreinforce-ai/Ornith-1.0-35B

Quantized

(79)

this model