Instructions to use maxlaurence/Ornith-1.0-35B-oQ5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maxlaurence/Ornith-1.0-35B-oQ5 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("maxlaurence/Ornith-1.0-35B-oQ5")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use maxlaurence/Ornith-1.0-35B-oQ5 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maxlaurence/Ornith-1.0-35B-oQ5"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use maxlaurence/Ornith-1.0-35B-oQ5 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maxlaurence/Ornith-1.0-35B-oQ5

Run Hermes

hermes

MLX LM

How to use maxlaurence/Ornith-1.0-35B-oQ5 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "maxlaurence/Ornith-1.0-35B-oQ5"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "maxlaurence/Ornith-1.0-35B-oQ5",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Ornith-1.0-35B-oQ5

This repository contains an unofficial oMLX oQ5 MLX quantization of deepreinforce-ai/Ornith-1.0-35B.

The source model is a Qwen3.5 MoE vision-language model released by DeepReinforce under the MIT license. This quantized artifact is intended for Apple Silicon inference with oMLX / MLX-compatible runtimes.

Quantization Summary

Source model: deepreinforce-ai/Ornith-1.0-35B
Output format: MLX safetensors
Quantizer: oMLX oQ streaming quantization
oQ level: oQ5
Base quantization: 5-bit affine
Group size: 64
Storage dtype for non-quantized floating tensors: BF16
Output size: about 24 GB on disk
Output shards: 6 safetensors shards
Per-layer quantization overrides: 352
Vision weights: preserved
MTP/speculative head: disabled in the output config because the source checkpoint does not contain mtp.* tensors

The final config.json contains both quantization and quantization_config entries so MLX loaders can apply the mixed-precision layout.

Why This Is Not a Plain Stock oQ Run

This checkpoint required two compatibility workarounds during quantization.

First, the source config.json advertises an MTP head:

"mtp_num_hidden_layers": 1

but the downloaded checkpoint does not contain corresponding mtp.* tensors. That causes the stock oMLX auto-proxy sensitivity path to fail before it can measure calibration sensitivity. For the quantized output, MTP was normalized to disabled (0) so the config matches the actual weights.

Second, the source checkpoint uses VLM-prefixed Qwen3.5 MoE expert keys. During the first streaming quantization pass, the backbone MoE experts were emitted as individual per-expert tensors such as:

language_model.model.layers.N.mlp.experts.E.gate_proj.weight
language_model.model.layers.N.mlp.experts.E.up_proj.weight
language_model.model.layers.N.mlp.experts.E.down_proj.weight

The current MLX/VLM loader expects those backbone expert weights in fused switch_mlp form:

language_model.model.layers.N.mlp.switch_mlp.gate_proj.weight
language_model.model.layers.N.mlp.switch_mlp.up_proj.weight
language_model.model.layers.N.mlp.switch_mlp.down_proj.weight

After quantization, the per-expert quantized tensors were stacked into the loader-compatible switch_mlp layout. This repack is structural: it stacks the already quantized weight, scales, and biases tensors. It does not dequantize, requantize, or otherwise change the quantized numeric values.

Sensitivity Allocation

oQ normally measures layer sensitivity by running calibration inference and then allocating higher precision to the layers where quantization error matters most. For this checkpoint, the automatic proxy calibration path was not usable for the reasons above. Instead, quantization used an explicit 40-layer positional sensitivity map:

first and last 12.5% of layers: highest sensitivity
next outer quartiles: moderate sensitivity
middle layers: lower sensitivity

That means this is still mixed-precision oQ5, but it is not a fully data-calibrated oQ artifact. The expected implication is that quality should be reasonable for an oQ5 quantization, but the bit allocation is less tailored than a successful calibration-based oQ run.

Conversion Procedure

The final artifact was produced locally from the original BF16 safetensors checkpoint using the oMLX app bundle's Python environment.

High-level steps:

Created a temporary symlinked source view of the original model.
Set stale MTP config fields to 0 in that temporary view only.
Added an explicit oq_sensitivity_map.json to bypass the incompatible auto-proxy sensitivity path.
Ran oMLX quantize_oq_streaming with:
- oq_level=5
- group_size=64
- dtype="bfloat16"
- text_only=False
- preserve_mtp=False
- auto_proxy_sensitivity=False
Repacked quantized MoE expert tensors from per-expert keys into switch_mlp keys expected by the current MLX/VLM loader.
Validated the final directory with a lazy MLX/VLM load.

The source model directory was not modified.

Final Validation

The final model directory passed these checks after repacking:

Output size:              24G
Safetensors shards:       6
Quantization bits:        5
Quantization group size:  64
Quantization mode:        affine
Per-layer overrides:      352
Mapped tensors:           2010
switch_mlp tensors:       360
per-expert MoE keys:      0
MTP layer count:          0
Lazy VLM load:            passed

Lazy load validation used the oMLX app bundle with mlx_vlm.utils.load_model and mlx_lm.tokenizer_utils.load. This validates the config/weight-key layout and catches loader mismatches. It is not the same as a full generation benchmark.

Usage

This model is meant to be used with oMLX or another runtime that supports MLX safetensors quantization metadata. After downloading or cloning the repository into an oMLX model directory, load it as a normal local MLX model.

For oMLX CLI usage with a local model directory:

omlx serve --model-dir /path/to/models

Then select or request the model by its directory/repository name, depending on your oMLX setup.

The original Ornith model card notes that Ornith is a reasoning model and may produce a <think>...</think> block before the final answer. Preserve the original chat template included with this repository when serving the model.

Known Limitations

This is an unofficial community quantization, not an official DeepReinforce release.
The oQ sensitivity map was heuristic/positional rather than measured by calibration inference, because the stock oMLX auto-proxy path could not load this checkpoint cleanly.
Native MTP/speculative decoding is disabled because the source checkpoint did not contain MTP weights despite advertising an MTP layer in config.
Lazy load validation passed, but no benchmark suite was run as part of this conversion.
Text and vision weights are present, but image-generation or image-question workflows should be smoke-tested in the target runtime before relying on this upload for production VLM use.

Relationship to the Original Model

Please cite and refer to the original model for architecture details, intended use, benchmark results, license, and upstream caveats:

Original model: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B
Organization: DeepReinforce
License: MIT

This repository only changes the storage/quantization format for MLX inference. It does not introduce additional training or fine-tuning.

Downloads last month: 831

Safetensors

Model size

7B params

Tensor type

BF16

U32

MLX

Hardware compatibility

5-bit

Model tree for maxlaurence/Ornith-1.0-35B-oQ5

Base model

deepreinforce-ai/Ornith-1.0-35B

Quantized

(68)

this model