Instructions to use maxlaurence/Ornith-1.0-35B-oQ5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use maxlaurence/Ornith-1.0-35B-oQ5 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("maxlaurence/Ornith-1.0-35B-oQ5") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use maxlaurence/Ornith-1.0-35B-oQ5 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "maxlaurence/Ornith-1.0-35B-oQ5" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use maxlaurence/Ornith-1.0-35B-oQ5 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default maxlaurence/Ornith-1.0-35B-oQ5
Run Hermes
hermes
- MLX LM
How to use maxlaurence/Ornith-1.0-35B-oQ5 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "maxlaurence/Ornith-1.0-35B-oQ5"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "maxlaurence/Ornith-1.0-35B-oQ5" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "maxlaurence/Ornith-1.0-35B-oQ5", "messages": [ {"role": "user", "content": "Hello"} ] }'
Ornith-1.0-35B-oQ5
This repository contains an unofficial oMLX oQ5 MLX quantization of
deepreinforce-ai/Ornith-1.0-35B.
The source model is a Qwen3.5 MoE vision-language model released by DeepReinforce under the MIT license. This quantized artifact is intended for Apple Silicon inference with oMLX / MLX-compatible runtimes.
Quantization Summary
- Source model:
deepreinforce-ai/Ornith-1.0-35B - Output format: MLX safetensors
- Quantizer: oMLX oQ streaming quantization
- oQ level:
oQ5 - Base quantization: 5-bit affine
- Group size: 64
- Storage dtype for non-quantized floating tensors: BF16
- Output size: about 24 GB on disk
- Output shards: 6 safetensors shards
- Per-layer quantization overrides: 352
- Vision weights: preserved
- MTP/speculative head: disabled in the output config because the source
checkpoint does not contain
mtp.*tensors
The final config.json contains both quantization and
quantization_config entries so MLX loaders can apply the mixed-precision
layout.
Why This Is Not a Plain Stock oQ Run
This checkpoint required two compatibility workarounds during quantization.
First, the source config.json advertises an MTP head:
"mtp_num_hidden_layers": 1
but the downloaded checkpoint does not contain corresponding mtp.* tensors.
That causes the stock oMLX auto-proxy sensitivity path to fail before it can
measure calibration sensitivity. For the quantized output, MTP was normalized to
disabled (0) so the config matches the actual weights.
Second, the source checkpoint uses VLM-prefixed Qwen3.5 MoE expert keys. During the first streaming quantization pass, the backbone MoE experts were emitted as individual per-expert tensors such as:
language_model.model.layers.N.mlp.experts.E.gate_proj.weight
language_model.model.layers.N.mlp.experts.E.up_proj.weight
language_model.model.layers.N.mlp.experts.E.down_proj.weight
The current MLX/VLM loader expects those backbone expert weights in fused
switch_mlp form:
language_model.model.layers.N.mlp.switch_mlp.gate_proj.weight
language_model.model.layers.N.mlp.switch_mlp.up_proj.weight
language_model.model.layers.N.mlp.switch_mlp.down_proj.weight
After quantization, the per-expert quantized tensors were stacked into the
loader-compatible switch_mlp layout. This repack is structural: it stacks the
already quantized weight, scales, and biases tensors. It does not
dequantize, requantize, or otherwise change the quantized numeric values.
Sensitivity Allocation
oQ normally measures layer sensitivity by running calibration inference and then allocating higher precision to the layers where quantization error matters most. For this checkpoint, the automatic proxy calibration path was not usable for the reasons above. Instead, quantization used an explicit 40-layer positional sensitivity map:
- first and last 12.5% of layers: highest sensitivity
- next outer quartiles: moderate sensitivity
- middle layers: lower sensitivity
That means this is still mixed-precision oQ5, but it is not a fully data-calibrated oQ artifact. The expected implication is that quality should be reasonable for an oQ5 quantization, but the bit allocation is less tailored than a successful calibration-based oQ run.
Conversion Procedure
The final artifact was produced locally from the original BF16 safetensors checkpoint using the oMLX app bundle's Python environment.
High-level steps:
- Created a temporary symlinked source view of the original model.
- Set stale MTP config fields to
0in that temporary view only. - Added an explicit
oq_sensitivity_map.jsonto bypass the incompatible auto-proxy sensitivity path. - Ran oMLX
quantize_oq_streamingwith:oq_level=5group_size=64dtype="bfloat16"text_only=Falsepreserve_mtp=Falseauto_proxy_sensitivity=False
- Repacked quantized MoE expert tensors from per-expert keys into
switch_mlpkeys expected by the current MLX/VLM loader. - Validated the final directory with a lazy MLX/VLM load.
The source model directory was not modified.
Final Validation
The final model directory passed these checks after repacking:
Output size: 24G
Safetensors shards: 6
Quantization bits: 5
Quantization group size: 64
Quantization mode: affine
Per-layer overrides: 352
Mapped tensors: 2010
switch_mlp tensors: 360
per-expert MoE keys: 0
MTP layer count: 0
Lazy VLM load: passed
Lazy load validation used the oMLX app bundle with mlx_vlm.utils.load_model
and mlx_lm.tokenizer_utils.load. This validates the config/weight-key layout
and catches loader mismatches. It is not the same as a full generation benchmark.
Usage
This model is meant to be used with oMLX or another runtime that supports MLX safetensors quantization metadata. After downloading or cloning the repository into an oMLX model directory, load it as a normal local MLX model.
For oMLX CLI usage with a local model directory:
omlx serve --model-dir /path/to/models
Then select or request the model by its directory/repository name, depending on your oMLX setup.
The original Ornith model card notes that Ornith is a reasoning model and may
produce a <think>...</think> block before the final answer. Preserve the
original chat template included with this repository when serving the model.
Known Limitations
- This is an unofficial community quantization, not an official DeepReinforce release.
- The oQ sensitivity map was heuristic/positional rather than measured by calibration inference, because the stock oMLX auto-proxy path could not load this checkpoint cleanly.
- Native MTP/speculative decoding is disabled because the source checkpoint did not contain MTP weights despite advertising an MTP layer in config.
- Lazy load validation passed, but no benchmark suite was run as part of this conversion.
- Text and vision weights are present, but image-generation or image-question workflows should be smoke-tested in the target runtime before relying on this upload for production VLM use.
Relationship to the Original Model
Please cite and refer to the original model for architecture details, intended use, benchmark results, license, and upstream caveats:
- Original model: https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B
- Organization: DeepReinforce
- License: MIT
This repository only changes the storage/quantization format for MLX inference. It does not introduce additional training or fine-tuning.
- Downloads last month
- 831
5-bit
Model tree for maxlaurence/Ornith-1.0-35B-oQ5
Base model
deepreinforce-ai/Ornith-1.0-35B