Instructions to use mlx-community/dhara-250m-OptiQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mlx-community/dhara-250m-OptiQ-4bit with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/dhara-250m-OptiQ-4bit")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use mlx-community/dhara-250m-OptiQ-4bit with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mlx-community/dhara-250m-OptiQ-4bit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use mlx-community/dhara-250m-OptiQ-4bit with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default mlx-community/dhara-250m-OptiQ-4bit

Run Hermes

hermes

MLX LM

How to use mlx-community/dhara-250m-OptiQ-4bit with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "mlx-community/dhara-250m-OptiQ-4bit"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "mlx-community/dhara-250m-OptiQ-4bit"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "mlx-community/dhara-250m-OptiQ-4bit",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

dhara-250m-OptiQ-4bit

Built with mlx-optiq, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. Try the Lab · All OptIQ quants · Docs

An OptIQ mixed-precision 4-bit quant of codelion/dhara-250m, the second member of OptIQ's Diffusion LLM family, for Apple Silicon.

dhara is a tri-mode 250M model: one set of weights that decodes three ways, standard autoregressive (left-to-right), block-diffusion (fill a block of tokens and iteratively un-mask it), and self-speculation (draft a block with the diffusion forward, verify with the AR forward). It is a custom architecture stock mlx-lm can't load (it adds Canon depthwise-conv layers, QK-norm after RoPE, and a logit soft-cap); OptIQ ships a vendored, mlx-native port that registers with mlx-lm and is bit-exact to the reference.

At 250M, dhara is a base to fine-tune, the way Google's Gemma-270M is, small enough to LoRA on-device for one task, not a general assistant.

Install

pip install mlx-optiq

Usage

import optiq  # registers the dhara architecture with mlx-lm
from mlx_lm import load, generate

model, tok = load("mlx-community/dhara-250m-OptiQ-4bit")
prompt = tok.apply_chat_template(
    [{"role": "user", "content": "Explain the Mediterranean climate."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tok, prompt))

Block-diffusion and self-speculation are handled by the OptIQ runtime. optiq serve --model mlx-community/dhara-250m-OptiQ-4bit serves an OpenAI/Anthropic-compatible API; --mtp routes through the self-speculative path. LoRA fine-tuning uses the standard optiq lora train autoregressive trainer.

The quant, 4-bit is lossless here

dhara is small enough that the weights aren't the bottleneck, so OptIQ's win is size, not a capability rescue. We measured the full 6-benchmark Capability Score (MMLU + GSM8K + IFEval + BFCL + HumanEval + HashHop) three ways, full-precision bf16, naive uniform 4-bit, and this OptIQ measured mixed-precision quant, and all three land within run-to-run noise.

Variant	Size	bpw	Capability	MMLU	IFEval
bf16 (reference)	460 MB	16	8.34	24.7	23.3
uniform 4-bit	130 MB	4.0	8.79	24.3	27.2
dhara-250m-OptiQ-4bit	170 MB	4.86	8.54	24.9	25.0

All three are within the IFEval noise band, full quality at 2.7× smaller. GSM8K, HumanEval, BFCL, and HashHop sit at the 250M floor for every variant; this is a genuine small-model ceiling (the model can't yet do multi-step math or tool calls, verified by inspecting raw generations with the model's own repetition penalty), not a quantization or harness artifact. The takeaway: quantization costs nothing here.

Scores are reported honestly. dhara-250m is meant to be fine-tuned on a specific task, where these base scores are the starting point, not the product.

Decode modes, self-speculation is the default

dhara decodes three ways from one set of weights. The recommended default is self-speculation (--mtp): it drafts a block in one parallel forward and verifies it autoregressively (two forwards per round, no commit pass), so the emitted output is identical to plain AR decode while committing ~3–4 tokens per round, AR accuracy at ~1.4× the speed of token-by-token AR. The model is overhead-bound (a 32-token forward costs about the same as a 1-token forward), and the 4-bit and bf16 weights decode at the same speed, so quantization buys size, not throughput.

Mode	Speed (M3 Max)	Character
self-speculation (`--mtp`)	~1.4× AR	recommended, output identical to AR, several tokens/round
autoregressive	~130 tok/s	the exact reference; pair with a repetition penalty (greedy can loop)
block-diffusion	parallel	prefix-cached; bidirectional (infilling), trades denoising steps for speed

Self-speculation guarantees AR-identical output because the AR verify decides every token; the speedup is free accuracy-wise and largest for fine-tuned models decoded greedily (the deployment case here). Self-spec and block-diffusion are prefix-cached (KV + Canon-conv state), so each step processes only the new block, O(block) per step, not O(sequence).

Quantization details

OptIQ measures each layer's quantization sensitivity (KL divergence vs the bf16 reference on calibration data) and assigns per-layer bit-widths under a target budget. This quant: 148 weight tensors at 4-bit + 76 at 8-bit, 4.86 bits-per-weight. The Canon depthwise convs, QK-norm, and logit soft-cap are not Linear modules, so they stay at bf16 automatically, only the attention and MLP projections are quantized.

Quantize your own

This quant was produced by mlx-optiq. Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune

License + provenance

Derived from codelion/dhara-250m. See the Diffusion LLM family guide for details.

Downloads last month: 4

Safetensors

Model size

49.9M params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for mlx-community/dhara-250m-OptiQ-4bit

Base model

codelion/dhara-250m-ar-base

Finetuned

codelion/dhara-250m

Quantized

(1)

this model