Ornith-1.0-35B-oQ5

This repository contains an unofficial oMLX oQ5 MLX quantization of deepreinforce-ai/Ornith-1.0-35B.

The source model is a Qwen3.5 MoE vision-language model released by DeepReinforce under the MIT license. This quantized artifact is intended for Apple Silicon inference with oMLX / MLX-compatible runtimes.

Quantization Summary

  • Source model: deepreinforce-ai/Ornith-1.0-35B
  • Output format: MLX safetensors
  • Quantizer: oMLX oQ streaming quantization
  • oQ level: oQ5
  • Base quantization: 5-bit affine
  • Group size: 64
  • Storage dtype for non-quantized floating tensors: BF16
  • Output size: about 24 GB on disk
  • Output shards: 6 safetensors shards
  • Per-layer quantization overrides: 352
  • Vision weights: preserved
  • MTP/speculative head: disabled in the output config because the source checkpoint does not contain mtp.* tensors

The final config.json contains both quantization and quantization_config entries so MLX loaders can apply the mixed-precision layout.

Why This Is Not a Plain Stock oQ Run

This checkpoint required two compatibility workarounds during quantization.

First, the source config.json advertises an MTP head:

"mtp_num_hidden_layers": 1

but the downloaded checkpoint does not contain corresponding mtp.* tensors. That causes the stock oMLX auto-proxy sensitivity path to fail before it can measure calibration sensitivity. For the quantized output, MTP was normalized to disabled (0) so the config matches the actual weights.

Second, the source checkpoint uses VLM-prefixed Qwen3.5 MoE expert keys. During the first streaming quantization pass, the backbone MoE experts were emitted as individual per-expert tensors such as:

language_model.model.layers.N.mlp.experts.E.gate_proj.weight
language_model.model.layers.N.mlp.experts.E.up_proj.weight
language_model.model.layers.N.mlp.experts.E.down_proj.weight

The current MLX/VLM loader expects those backbone expert weights in fused switch_mlp form:

language_model.model.layers.N.mlp.switch_mlp.gate_proj.weight
language_model.model.layers.N.mlp.switch_mlp.up_proj.weight
language_model.model.layers.N.mlp.switch_mlp.down_proj.weight

After quantization, the per-expert quantized tensors were stacked into the loader-compatible switch_mlp layout. This repack is structural: it stacks the already quantized weight, scales, and biases tensors. It does not dequantize, requantize, or otherwise change the quantized numeric values.

Sensitivity Allocation

oQ normally measures layer sensitivity by running calibration inference and then allocating higher precision to the layers where quantization error matters most. For this checkpoint, the automatic proxy calibration path was not usable for the reasons above. Instead, quantization used an explicit 40-layer positional sensitivity map:

  • first and last 12.5% of layers: highest sensitivity
  • next outer quartiles: moderate sensitivity
  • middle layers: lower sensitivity

That means this is still mixed-precision oQ5, but it is not a fully data-calibrated oQ artifact. The expected implication is that quality should be reasonable for an oQ5 quantization, but the bit allocation is less tailored than a successful calibration-based oQ run.

Conversion Procedure

The final artifact was produced locally from the original BF16 safetensors checkpoint using the oMLX app bundle's Python environment.

High-level steps:

  1. Created a temporary symlinked source view of the original model.
  2. Set stale MTP config fields to 0 in that temporary view only.
  3. Added an explicit oq_sensitivity_map.json to bypass the incompatible auto-proxy sensitivity path.
  4. Ran oMLX quantize_oq_streaming with:
    • oq_level=5
    • group_size=64
    • dtype="bfloat16"
    • text_only=False
    • preserve_mtp=False
    • auto_proxy_sensitivity=False
  5. Repacked quantized MoE expert tensors from per-expert keys into switch_mlp keys expected by the current MLX/VLM loader.
  6. Validated the final directory with a lazy MLX/VLM load.

The source model directory was not modified.

Final Validation

The final model directory passed these checks after repacking:

Output size:              24G
Safetensors shards:       6
Quantization bits:        5
Quantization group size:  64
Quantization mode:        affine
Per-layer overrides:      352
Mapped tensors:           2010
switch_mlp tensors:       360
per-expert MoE keys:      0
MTP layer count:          0
Lazy VLM load:            passed

Lazy load validation used the oMLX app bundle with mlx_vlm.utils.load_model and mlx_lm.tokenizer_utils.load. This validates the config/weight-key layout and catches loader mismatches. It is not the same as a full generation benchmark.

Usage

This model is meant to be used with oMLX or another runtime that supports MLX safetensors quantization metadata. After downloading or cloning the repository into an oMLX model directory, load it as a normal local MLX model.

For oMLX CLI usage with a local model directory:

omlx serve --model-dir /path/to/models

Then select or request the model by its directory/repository name, depending on your oMLX setup.

The original Ornith model card notes that Ornith is a reasoning model and may produce a <think>...</think> block before the final answer. Preserve the original chat template included with this repository when serving the model.

Known Limitations

  • This is an unofficial community quantization, not an official DeepReinforce release.
  • The oQ sensitivity map was heuristic/positional rather than measured by calibration inference, because the stock oMLX auto-proxy path could not load this checkpoint cleanly.
  • Native MTP/speculative decoding is disabled because the source checkpoint did not contain MTP weights despite advertising an MTP layer in config.
  • Lazy load validation passed, but no benchmark suite was run as part of this conversion.
  • Text and vision weights are present, but image-generation or image-question workflows should be smoke-tested in the target runtime before relying on this upload for production VLM use.

Relationship to the Original Model

Please cite and refer to the original model for architecture details, intended use, benchmark results, license, and upstream caveats:

This repository only changes the storage/quantization format for MLX inference. It does not introduce additional training or fine-tuning.

Downloads last month
831
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maxlaurence/Ornith-1.0-35B-oQ5

Quantized
(68)
this model