DFlash-MLX-Universal: System Usage Guide
How to use
dflash-mlx-universalon your Apple Silicon Mac (M1/M2/M3/M4)
π Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| macOS | 14+ (Sonoma/Sequoia) | MLX requires Apple Silicon |
| Python | 3.9 - 3.12 | Recommend 3.11 or 3.12 |
| Chip | M1/M2/M3/M4 (Pro/Max/Ultra) | Unified memory required for large models |
| Memory | 16GB+ minimum, 32GB+ recommended | 96GB for 70B+ models |
1οΈβ£ Installation (Recommended: uv)
uv is an extremely fast Python package manager written in Rust. It's the recommended way to install dflash-mlx-universal.
Install uv (One-time)
# Option A: Homebrew (macOS)
brew install uv
# Option B: Official installer
curl -LsSf https://astral.sh/uv/install.sh | sh
# Verify
uv --version # Should show 0.6.x or higher
Install DFlash-MLX-Universal with uv
# 1. Clone the repo
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git
cd dflash-mlx-universal
# 2. Create virtual environment with uv (uses .python-version file)
uv venv
# 3. Install in editable mode with all dependencies
uv pip install -e ".[dev,server]"
# Or install directly from the repo
uv pip install "git+https://huggingface.co/tritesh/dflash-mlx-universal.git[dev,server]"
Alternative: uv project workflow (no manual venv)
# 1. Enter project directory
cd dflash-mlx-universal
# 2. uv automatically reads pyproject.toml and .python-version
uv run python -c "import dflash_mlx; print(dflash_mlx.__version__)"
# 3. Lock dependencies (creates uv.lock)
uv lock
# 4. Run any script with automatic dependency resolution
uv run python examples/qwen3_4b_demo.py
# 5. Run tests
uv run pytest tests/ -v
# 6. Start server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./Qwen3-4B-DFlash-mlx --port 8000
With uv and dependency groups
# Install only core dependencies
uv pip install -e .
# Install with server extras (FastAPI + uvicorn)
uv pip install -e ".[server]"
# Install with dev extras (pytest, black, ruff)
uv pip install -e ".[dev]"
# Install everything at once
uv pip install -e ".[dev,server]"
1οΈβ£-alt Installation (Classic pip)
If you prefer pip:
# 1. Create virtual environment
python3 -m venv .venv-dflash
source .venv-dflash/bin/activate # On zsh/bash
# 2. Upgrade pip
pip install --upgrade pip
# 3. Install core dependencies
pip install mlx-lm>=0.24.0 transformers>=4.57.0 huggingface-hub>=0.25.0
# 4. Install dflash-mlx-universal from your repo
pip install git+https://huggingface.co/tritesh/dflash-mlx-universal.git
# Optional: server mode
pip install fastapi uvicorn
2οΈβ£ Quick Start β Using a Pre-converted Drafter
Step A: Convert an Official DFlash Drafter to MLX
Official drafters are PyTorch models. You need to convert them to MLX format once:
# With uv (recommended)
uv run python -m dflash_mlx.convert \
--model z-lab/Qwen3-4B-DFlash-b16 \
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
# With classic pip
python -m dflash_mlx.convert \
--model z-lab/Qwen3-4B-DFlash-b16 \
--output ~/models/dflash/Qwen3-4B-DFlash-mlx
Supported drafters:
# Qwen3 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-8B-DFlash-b16 --output ~/models/dflash/Qwen3-8B-DFlash-mlx
# Qwen3.5 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-9B-DFlash --output ~/models/dflash/Qwen3.5-9B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.5-27B-DFlash --output ~/models/dflash/Qwen3.5-27B-DFlash-mlx
# Qwen3.6 series
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-27B-DFlash --output ~/models/dflash/Qwen3.6-27B-DFlash-mlx
uv run python -m dflash_mlx.convert --model z-lab/Qwen3.6-35B-A3B-DFlash --output ~/models/dflash/Qwen3.6-35B-DFlash-mlx
# LLaMA
uv run python -m dflash_mlx.convert --model z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat --output ~/models/dflash/LLaMA3.1-8B-DFlash-mlx
# Gemma
uv run python -m dflash_mlx.convert --model z-lab/gemma-4-31B-it-DFlash --output ~/models/dflash/gemma-4-31B-DFlash-mlx
# GPT-OSS
uv run python -m dflash_mlx.convert --model z-lab/gpt-oss-20b-DFlash --output ~/models/dflash/gpt-oss-20b-DFlash-mlx
What this does:
- Downloads PyTorch weights from HF Hub
- Transposes linear layers (PyTorch β MLX format)
- Saves as
weights.npz+config.json - Creates
model_info.jsonwith target model mapping
Step B: Generate with DFlash Speculative Decoding
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
# 1. Load target model (any MLX-converted model)
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
# 2. Load converted DFlash drafter
draft_model, draft_config = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
# 3. Create decoder (auto-detects architecture via adapters)
decoder = DFlashSpeculativeDecoder(
target_model=model,
draft_model=draft_model,
tokenizer=tokenizer,
block_size=draft_config.get("block_size", 16),
)
# 4. Generate with 6Γ speedup
output = decoder.generate(
prompt="Write a Python function to implement quicksort.",
max_tokens=1024,
temperature=0.0, # Greedy for exact reproduction
)
print(output)
Run with uv:
uv run python my_generate_script.py
Expected output:
[DFlash] Prefill: processing 12 prompt tokens...
[DFlash] Starting speculative decoding (block_size=16)...
[DFlash] Done. Generated 1024 tokens, avg acceptance: 6.23, effective speedup: ~5.8x
3οΈβ£ Streaming Generation
For real-time output (chat UI, etc.):
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
# Generator-based streaming
for chunk in decoder.generate(
prompt="Tell me a story about a robot.",
max_tokens=512,
stream=True, # β Returns generator
):
print(chunk, end="", flush=True)
4οΈβ£ Benchmark Mode
Compare DFlash vs baseline speed:
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
model, tokenizer = load("mlx-community/Qwen3-4B-bf16")
draft_model, _ = load_mlx_dflash("~/models/dflash/Qwen3-4B-DFlash-mlx")
decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
# Run benchmark
results = decoder.benchmark(
prompt="Write a quicksort in Python.",
max_tokens=512,
num_runs=5,
)
print(f"Speedup: {results['speedup']:.2f}x")
print(f"Tokens/sec: {results['tokens_per_sec']:.1f}")
Run:
uv run python benchmark_script.py
Sample results (M2 Pro Max 96GB):
[Benchmark] Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
5οΈβ£ Universal Decoder (Any Model Without Pre-built Drafter)
If your model doesn't have a DFlash drafter yet:
from mlx_lm import load
from dflash_mlx.universal import UniversalDFlashDecoder
# Load ANY mlx_lm model
model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
# UniversalDFlashDecoder:
# 1. Auto-detects architecture (LLaMA in this case)
# 2. Creates a generic 5-layer drafter (~500MB)
# 3. Sets up proper adapter for hidden state extraction
decoder = UniversalDFlashDecoder(
target_model=model,
tokenizer=tokenizer,
draft_layers=5,
draft_hidden_size=1024,
block_size=16,
)
# Option A: Train a custom drafter (2-8 hours on Apple Silicon)
decoder.train_drafter(
dataset="open-web-math", # or local JSONL
epochs=6,
lr=6e-4,
batch_size=16,
output_path="~/models/dflash/my-llama-drafter",
)
# Option B: Use untrained (low quality, for testing only)
output = decoder.generate(
prompt="Hello world!",
max_tokens=100,
)
6οΈβ£ OpenAI-Compatible Server
Run a local server compatible with OpenAI clients:
# With uv (recommended)
uv run python -m dflash_mlx.serve \
--target mlx-community/Qwen3-4B-bf16 \
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
--block-size 16 \
--port 8000
# Or in background
nohup uv run python -m dflash_mlx.serve \
--target mlx-community/Qwen3-4B-bf16 \
--draft ~/models/dflash/Qwen3-4B-DFlash-mlx \
--port 8000 > dflash.log 2>&1 &
Query the server
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"max_tokens": 512,
"temperature": 0.0
}'
# Streaming
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-4b",
"messages": [{"role": "user", "content": "Count to 10"}],
"max_tokens": 100,
"stream": true
}'
# Check metrics
curl http://localhost:8000/metrics
Python client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # Local server, no auth
)
response = client.chat.completions.create(
model="qwen3-4b",
messages=[{"role": "user", "content": "Write a haiku about ML"}],
max_tokens=100,
)
print(response.choices[0].message.content)
7οΈβ£ Using with Ollama, aider, Continue, etc.
Any OpenAI-compatible client works:
aider (AI coding assistant)
aider --model openai/qwen3-4b \
--openai-api-base http://localhost:8000/v1 \
--openai-api-key not-needed
Continue.dev (VS Code extension)
// .continue/config.json
{
"models": [{
"title": "DFlash Qwen3-4B",
"provider": "openai",
"model": "qwen3-4b",
"apiBase": "http://localhost:8000/v1",
"apiKey": "not-needed"
}]
}
Ollama (as custom endpoint)
Configure any OpenAI-compatible frontend to point at http://localhost:8000/v1
8οΈβ£ Supported Model Families
| Family | Target Model Example | Drafter Status |
|---|---|---|
| Qwen3 | mlx-community/Qwen3-4B-bf16 |
β Pre-built |
| Qwen3.5 | mlx-community/Qwen3.5-9B-4bit |
β Pre-built |
| Qwen3.6 | mlx-community/Qwen3.6-27B-4bit |
β Pre-built |
| LLaMA 3.1 | mlx-community/Llama-3.1-8B-Instruct-4bit |
β Pre-built |
| LLaMA 3.3 | mlx-community/Llama-3.3-70B-Instruct-4bit |
β Pre-built |
| Mistral | mlx-community/Mistral-7B-Instruct-v0.3-4bit |
β οΈ Train custom |
| Gemma | mlx-community/gemma-4-31b-it-4bit |
β Pre-built |
| Phi | mlx-community/Phi-3-mini-4k-instruct-4bit |
β οΈ Generic adapter |
9οΈβ£ Troubleshooting
"Unsupported model_type: phi"
# Add a custom adapter for your model
from dflash_mlx.adapters import MLXTargetAdapter, ADAPTERS
class PhiAdapter(MLXTargetAdapter):
family = "phi"
# Override methods as needed...
ADAPTERS["phi"] = PhiAdapter
"Vocab size mismatch"
Ensure target model and draft model share the same tokenizer vocabulary. Drafters are trained for specific target families.
Slow first run
MLX compiles Metal kernels lazily. First generation is slow; subsequent runs are fast. The benchmark method includes warmup.
Out of memory
- Reduce
--block-size(default 16 β 8) - Use 4-bit quantized target models (
-4bitsuffix) - Reduce
max_tokens
Draft tokens all rejected
- Drafter may not match target model (wrong family)
- Use trained drafter for your specific model
- Check
target_layer_idsalignment in config
π Full Example Script
Save as run_dflash.py:
#!/usr/bin/env python3
"""Complete DFlash example with error handling."""
import sys
from mlx_lm import load
from dflash_mlx import DFlashSpeculativeDecoder
from dflash_mlx.convert import load_mlx_dflash
def main():
# Configuration
TARGET_MODEL = "mlx-community/Qwen3-4B-bf16"
DRAFT_MODEL = "~/models/dflash/Qwen3-4B-DFlash-mlx"
PROMPT = "Explain how speculative decoding works."
MAX_TOKENS = 512
print(f"Loading target model: {TARGET_MODEL}")
model, tokenizer = load(TARGET_MODEL)
print(f"Loading DFlash drafter: {DRAFT_MODEL}")
try:
draft_model, draft_config = load_mlx_dflash(DRAFT_MODEL)
except FileNotFoundError:
print(f"Error: Drafter not found at {DRAFT_MODEL}")
print("Convert first: uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ~/models/dflash/Qwen3-4B-DFlash-mlx")
sys.exit(1)
print("Creating DFlash decoder...")
decoder = DFlashSpeculativeDecoder(
target_model=model,
draft_model=draft_model,
tokenizer=tokenizer,
block_size=draft_config.get("block_size", 16),
)
print(f"\nPrompt: {PROMPT}")
print("-" * 60)
output = decoder.generate(
prompt=PROMPT,
max_tokens=MAX_TOKENS,
temperature=0.0,
)
print(output)
print("-" * 60)
print("Done!")
if __name__ == "__main__":
main()
Run:
uv run python run_dflash.py
π Daily Workflow with uv
# cd into your project
cd ~/projects/dflash-mlx-universal
# Run any script β uv handles the virtual env automatically
uv run python examples/qwen3_4b_demo.py
# Run the server
uv run python -m dflash_mlx.serve --target mlx-community/Qwen3-4B-bf16 --draft ./drafter --port 8000
# Run tests
uv run pytest tests/ -v
# Format code
uv run black dflash_mlx/
# Lint
uv run ruff check dflash_mlx/
# Add a dependency
uv add numpy>=1.26.0
# Lock dependencies
uv lock
# Sync environment with lock file
uv sync
π Next Steps
- Install
uvβbrew install uv - Clone repo β
git clone https://huggingface.co/tritesh/dflash-mlx-universal.git - Install β
cd dflash-mlx-universal && uv pip install -e ".[dev,server]" - Convert drafter β
uv run python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./drafter - Benchmark β
uv run python examples/qwen3_4b_demo.py - Start server β
uv run python -m dflash_mlx.serve --target ... --draft ... - Connect tools β aider, Continue, custom clients
- Train custom drafters β For unsupported models using
UniversalDFlashDecoder
For questions/issues: https://huggingface.co/tritesh/dflash-mlx-universal/discussions