Instructions to use mlx-community/dhara-250m-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/dhara-250m-OptiQ-4bit with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/dhara-250m-OptiQ-4bit") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use mlx-community/dhara-250m-OptiQ-4bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/dhara-250m-OptiQ-4bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/dhara-250m-OptiQ-4bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/dhara-250m-OptiQ-4bit
Run Hermes
hermes
- MLX LM
How to use mlx-community/dhara-250m-OptiQ-4bit with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/dhara-250m-OptiQ-4bit"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/dhara-250m-OptiQ-4bit", "messages": [ {"role": "user", "content": "Hello"} ] }'
dhara-250m-OptiQ-4bit
Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs
An OptIQ mixed-precision 4-bit quant of codelion/dhara-250m, the second member of OptIQ's Diffusion LLM family, for Apple Silicon.
dhara is a tri-mode 250M model: one set of weights that decodes three ways, standard autoregressive (left-to-right), block-diffusion (fill a block of tokens and iteratively un-mask it), and self-speculation (draft a block with the diffusion forward, verify with the AR forward). It is a custom architecture stock mlx-lm can't load (it adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap); OptIQ ships a vendored, mlx-native port that registers with mlx-lm and is bit-exact to the reference.
At 250M, dhara is a base to fine-tune, the way Google's Gemma-270M is, small enough to LoRA on-device for one task, not a general assistant.
Install
pip install mlx-optiq
Usage
import optiq # registers the dhara architecture with mlx-lm
from mlx_lm import load, generate
model, tok = load("mlx-community/dhara-250m-OptiQ-4bit")
prompt = tok.apply_chat_template(
[{"role": "user", "content": "Explain the Mediterranean climate."}],
tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))
Block-diffusion and self-speculation are handled by the OptIQ runtime. optiq serve --model mlx-community/dhara-250m-OptiQ-4bit serves an OpenAI/Anthropic-compatible API; --mtp routes through the self-speculative path. LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.
The quant, 4-bit is lossless here
dhara is small enough that the weights aren't the bottleneck, so OptIQ's win is size, not a capability rescue. We measured the full 6-benchmark Capability Score (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop) three ways, full-precision bf16, naive uniform 4-bit, and this OptIQ measured mixed-precision quant, and all three land within run-to-run noise.
| Variant | Size | bpw | Capability | MMLU | IFEval |
|---|---|---|---|---|---|
| bf16 (reference) | 460 MB | 16 | 8.34 | 24.7 | 23.3 |
| uniform 4-bit | 130 MB | 4.0 | 8.79 | 24.3 | 27.2 |
| dhara-250m-OptiQ-4bit | 170 MB | 4.86 | 8.54 | 24.9 | 25.0 |
All three are within the IFEval noise band, full quality at 2.7× smaller. GSM8K, HumanEval, BFCL, and HashHop sit at the 250M floor for every variant; this is a genuine small-model ceiling (the model can't yet do multi-step math or tool calls, verified by inspecting raw generations with the model's own repetition penalty), not a quantization or harness artifact. The takeaway: quantization costs nothing here.
Scores are reported honestly. dhara-250m is meant to be fine-tuned on a specific task, where these base scores are the starting point, not the product.
Decode modes, self-speculation is the default
dhara decodes three ways from one set of weights. The recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round, AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the 4-bit and bf16 weights decode at the same speed, so quantization buys size, not throughput.
| Mode | Speed (M3 Max) | Character |
|---|---|---|
self-speculation (--mtp) |
~1.4× AR | recommended, output identical to AR, several tokens/round |
| autoregressive | ~130 tok/s | the exact reference; pair with a repetition penalty (greedy can loop) |
| block-diffusion | parallel | prefix-cached; bidirectional (infilling), trades denoising steps for speed |
Self-speculation guarantees AR-identical output because the AR verify decides every token; the speedup is free accuracy-wise and largest for fine-tuned models decoded greedily (the deployment case here). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block, O(block) per step, not O(sequence).
Quantization details
OptIQ measures each layer's quantization sensitivity (KL divergence vs the bf16 reference on calibration data) and assigns per-layer bit-widths under a target budget. This quant: 148 weight tensors at 4-bit + 76 at 8-bit, 4.86 bits-per-weight. The Canon depthwise convs, QK-norm, and logit soft-cap are not Linear modules, so they stay at bf16 automatically, only the attention and MLP projections are quantized.
Quantize your own
This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:
pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab # full local workbench: chat, compare, quantize, fine-tune
License + provenance
Derived from codelion/dhara-250m. See the Diffusion LLM family guide for details.
- Downloads last month
- 4
4-bit