dhara-250m-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

An OptIQ mixed-precision 4-bit quant of codelion/dhara-250m, the second member of OptIQ's Diffusion LLM family, for Apple Silicon.

dhara is a tri-mode 250M model: one set of weights that decodes three ways, standard autoregressive (left-to-right), block-diffusion (fill a block of tokens and iteratively un-mask it), and self-speculation (draft a block with the diffusion forward, verify with the AR forward). It is a custom architecture stock mlx-lm can't load (it adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap); OptIQ ships a vendored, mlx-native port that registers with mlx-lm and is bit-exact to the reference.

At 250M, dhara is a base to fine-tune, the way Google's Gemma-270M is, small enough to LoRA on-device for one task, not a general assistant.

Install

pip install mlx-optiq

Usage

import optiq  # registers the dhara architecture with mlx-lm
from mlx_lm import load, generate

model, tok = load("mlx-community/dhara-250m-OptiQ-4bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain the Mediterranean climate."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))

Block-diffusion and self-speculation are handled by the OptIQ runtime. optiq serve --model mlx-community/dhara-250m-OptiQ-4bit serves an OpenAI/Anthropic-compatible API; --mtp routes through the self-speculative path. LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.

The quant, 4-bit is lossless here

dhara is small enough that the weights aren't the bottleneck, so OptIQ's win is size, not a capability rescue. We measured the full 6-benchmark Capability Score (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop) three ways, full-precision bf16, naive uniform 4-bit, and this OptIQ measured mixed-precision quant, and all three land within run-to-run noise.

Variant Size bpw Capability MMLU IFEval
bf16 (reference) 460 MB 16 8.34 24.7 23.3
uniform 4-bit 130 MB 4.0 8.79 24.3 27.2
dhara-250m-OptiQ-4bit 170 MB 4.86 8.54 24.9 25.0

All three are within the IFEval noise band, full quality at 2.7× smaller. GSM8K, HumanEval, BFCL, and HashHop sit at the 250M floor for every variant; this is a genuine small-model ceiling (the model can't yet do multi-step math or tool calls, verified by inspecting raw generations with the model's own repetition penalty), not a quantization or harness artifact. The takeaway: quantization costs nothing here.

Scores are reported honestly. dhara-250m is meant to be fine-tuned on a specific task, where these base scores are the starting point, not the product.

Decode modes, self-speculation is the default

dhara decodes three ways from one set of weights. The recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round, AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the 4-bit and bf16 weights decode at the same speed, so quantization buys size, not throughput.

Mode Speed (M3 Max) Character
self-speculation (--mtp) ~1.4× AR recommended, output identical to AR, several tokens/round
autoregressive ~130 tok/s the exact reference; pair with a repetition penalty (greedy can loop)
block-diffusion parallel prefix-cached; bidirectional (infilling), trades denoising steps for speed

Self-speculation guarantees AR-identical output because the AR verify decides every token; the speedup is free accuracy-wise and largest for fine-tuned models decoded greedily (the deployment case here). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block, O(block) per step, not O(sequence).

Quantization details

OptIQ measures each layer's quantization sensitivity (KL divergence vs the bf16 reference on calibration data) and assigns per-layer bit-widths under a target budget. This quant: 148 weight tensors at 4-bit + 76 at 8-bit, 4.86 bits-per-weight. The Canon depthwise convs, QK-norm, and logit soft-cap are not Linear modules, so they stay at bf16 automatically, only the attention and MLP projections are quantized.

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune

License + provenance

Derived from codelion/dhara-250m. See the Diffusion LLM family guide for details.

Downloads last month
4
Safetensors
Model size
49.9M params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/dhara-250m-OptiQ-4bit

Quantized
(1)
this model