Instructions to use leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx") config = load_config("leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx
Run Hermes
hermes
- OpenClaw new
How to use leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx with OpenClaw:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx"
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx" \ --custom-provider-id mlx-lm \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx
This model was converted to MLX format from deepreinforce-ai/Ornith-1.0-35B using BaseQuant_XL 5/8-bit mixed quantization optimized for Apple Silicon. The vision encoder is preserved and quantized at 5-bit, making this a full multimodal model.
BaseQuant_XL keeps the most routing-critical layers in full bf16 precision — the MoE router gate, shared expert gate, shared expert, and lm_head — while applying aggressive quantization to the bulk parameters. This preserves routing accuracy and output quality where it matters most.
Ornith-1.0-35B is a 35B-parameter MoE (Mixture of Experts) model fine-tuned from Qwen3.5-35B-A3B by DeepReinforce AI, using a self-improving RL training framework that jointly optimizes scaffold and solution rollouts for agentic coding tasks. Despite 35B total parameters, only ~3B are activated per token. It features 256 experts (8 active per token + 1 shared expert), hybrid full + linear (Gated DeltaNet) attention, and a vision encoder.
Benchmark Highlights
| Benchmark | Ornith-1.0-35B | Qwen3.5-35B | Qwen3.6-35B |
|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 64.2 | 41.4 | 52.5 |
| Terminal-Bench 2.1 (Claude Code) | 62.8 | 38.9 | 49.2 |
| SWE-bench Verified | 75.6 | 70 | 73.4 |
| SWE-bench Pro | 50.4 | 44.6 | 49.5 |
| SWE-bench Multilingual | 69.3 | 60.3 | 67.2 |
| NL2Repo | 34.6 | 20.5 | 29.4 |
| Claw-eval Avg | 69.8 | 65.4 | 68.7 |
Use with mlx
pip install -U mlx-vlm
python -m mlx_vlm.generate --model leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx --max-tokens 256 --temperature 1.0 --top-p 1.0 --prompt "Hello"
BaseQuant_XL Quantization Strategy
| Bit Depth | Layers | Rationale |
|---|---|---|
| bf16 (unquantized) | mlp.gate (router), shared_expert_gate, lm_head, shared_expert |
Routing decisions and shared computation path — errors here are qualitatively different from precision loss |
| 8-bit | embed_tokens, self_attn (full attention), linear_attn (DeltaNet) |
Every-token layers with moderate sensitivity — 8-bit is near-lossless |
| 5-bit | vision_tower, switch_mlp (routed experts) |
Bulk of parameters, only 8 of 256 experts active per token — natural redundancy tolerates lower precision |
Quantization Details
| Layer | Bits | Group Size |
|---|---|---|
mlp.gate (router) |
bf16 | — |
shared_expert_gate |
bf16 | — |
lm_head |
bf16 | — |
shared_expert |
bf16 | — |
embed_tokens |
8 | 64 |
self_attn (full attention) |
8 | 64 |
linear_attn (DeltaNet) |
8 | 64 |
vision_tower |
5 | 64 |
switch_mlp (routed experts) |
5 | 64 |
| Default fallback | 8 | 64 |
- Quantization type: BaseQuant_XL mixed (multimodal, vision preserved)
- Bits per weight: 5.881
- Group size: 64
- Method: Custom
quant_predicateviamlx_vlm
Recommended Inference Parameters
| Parameter | Value |
|---|---|
temperature |
1.0 |
top_p |
1.0 |
top_k |
40 |
min_p |
0.01 |
repeat_penalty |
1.05 |
Note: Ornith-1.0-35B uses Temp 1.0 and Top_p 1.0 per the model's Terminal-Bench 2.1 benchmark recipe. This is a Qwen3.5-based model —
preserve_thinkingis not applicable.
- Downloads last month
- 574
5-bit
Model tree for leonsarmiento/Ornith-1.0-35B-5bit-XL-mlx
Base model
deepreinforce-ai/Ornith-1.0-35B