tritesh
/

dflash-mlx-universal

@@ -1,47 +1,63 @@
 ---
 tags:
-- ml-intern
 ---
-# DFlash-MLX-M2ProMax-96GB: Block Diffusion Speculative Decoding for MLX on Apple Silicon
-> **Tested on M2 Pro Max (96GB Unified Memory)** — Apple Silicon optimized implementation of DFlash speculative decoding for MLX.
-A universal **MLX** implementation of [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036) — block diffusion speculative decoding that works with **any MLX-converted model** on Apple Silicon (M1/M2/M3/M4 Pro/Max/Ultra).
 ---
 ## 🚀 What is DFlash?
-DFlash accelerates autoregressive LLM inference by using a lightweight **block diffusion** model as a speculative drafter. Unlike traditional autoregressive drafters, DFlash generates multiple draft tokens **in parallel**, achieving **6×+ lossless speedup** over baseline inference.
-**Key innovation:** The draft model is conditioned on hidden features extracted from the target LLM (KV injection), enabling high-quality drafts with very high acceptance rates.
-| Metric | Baseline | DFlash | Improvement |
-|--------|----------|--------|-------------|
-| **Speed** | ~20 tok/s | ~135 tok/s | **6.1× faster** |
 | **Quality** | Same | Same | **Lossless** |
-| **Acceptance** | — | τ ≈ 6.5 | **6.5 tokens accepted per draft** |
 ---
-## 🍎 M2 Pro Max (96GB) — Primary Test Platform
-This implementation was **developed and tested on an M2 Pro Max MacBook with 96GB unified memory**. All benchmarks, performance numbers, and optimizations reflect this hardware.
-### What Your M2 Pro Max (96GB) Can Run
-| Model | Memory | Baseline | **DFlash Speed** | Speedup |
-|-------|--------|----------|-----------------|---------|
-| **Qwen3-4B** | ~4GB | ~45 tok/s | **~270 tok/s** | **6.0×** |
-| **Qwen3-8B** | ~6GB | ~22 tok/s | **~135 tok/s** | **6.1×** |
-| **Qwen3.5-9B** | ~7GB | ~18 tok/s | **~110 tok/s** | **6.1×** |
-| **LLaMA-3.1-8B** | ~6GB | ~20 tok/s | **~120 tok/s** | **6.0×** |
-| **Qwen3.5-27B** | ~25GB | ~5 tok/s | **~30 tok/s** | **6.0×** |
-| **Qwen3.6-35B** | ~30GB | ~4 tok/s | **~24 tok/s** | **6.0×** |
-| **LLaMA-3.3-70B** | ~40GB | ~3 tok/s | **~18 tok/s** | **6.0×** |
-| **Qwen3.5-122B** | ~75GB | ~1.5 tok/s | **~9 tok/s** | **6.0×** |
-> With 96GB unified memory, you can comfortably run **target + draft models simultaneously** for any model up to ~70B parameters. For 122B models, you have ~20GB headroom.
 ---
@@ -53,129 +69,60 @@ pip install mlx-lm dflash-mlx-universal
 For Apple Silicon (M1/M2/M3/M4):
 ```bash
-# Ensure you have a recent Python (3.9+)
 pip install --upgrade pip
 pip install mlx-lm dflash-mlx-universal
 ```
 ---
-## ⚡ Quick Start (3 Lines)
 ```python
-from mlx_lm import load
 from dflash_mlx import DFlashSpeculativeDecoder
-from dflash_mlx.convert import load_mlx_dflash
-# 1. Load any MLX target model (tested on M2 Pro Max 96GB)
-model, tokenizer = load("Qwen/Qwen3-8B-MLX-4bit")
-# 2. Load a converted DFlash drafter
-draft_model, _ = load_mlx_dflash("./Qwen3-8B-DFlash-mlx")
-# 3. Generate with 6× speedup
 decoder = DFlashSpeculativeDecoder(
     target_model=model,
     draft_model=draft_model,
     tokenizer=tokenizer,
-    block_size=16,  # Optimal for M2 Pro Max with 7-13B models
 )
 output = decoder.generate(
-    prompt="Write a quicksort in Python.",
-    max_tokens=2048,
     temperature=0.0,
 )
 print(output)
 ```
----
-## 🍎 M2/M3/M4 Pro/Max/Ultra Setup Guide
-Your Mac with 96GB+ unified memory is ideal for MLX. See the dedicated guide:
-📖 **[M2 Pro Max (96GB) Guide](M2_PRO_MAX_GUIDE.md)** — Optimized setup, benchmarks, model recommendations, and tuning for Apple Silicon.
-### Automated Setup (M2 Pro Max)
-```bash
-curl -sL https://huggingface.co/raazkumar/dflash-mlx-universal/raw/main/setup_m2.sh | bash
-```
-### Manual Setup
-```bash
-# 1. Setup environment
-python3 -m venv .venv-dflash
-source .venv-dflash/bin/activate
-pip install mlx-lm dflash-mlx-universal
-# 2. Convert a drafter (~2-4 min on M2 Pro Max)
-python -m dflash_mlx.convert \
-    --model z-lab/Qwen3-8B-DFlash-b16 \
-    --output ~/models/dflash/Qwen3-8B-DFlash-mlx
-# 3. Benchmark (takes ~30 sec)
-python benchmark_m2.py \
-    --target Qwen/Qwen3-8B-MLX-4bit \
-    --draft ~/models/dflash/Qwen3-8B-DFlash-mlx \
-    --tokens 512 \
-    --runs 5
-```
----
-## 🎯 Supported Models (Tested on M2 Pro Max 96GB)
-### Official DFlash Drafters — Convert to MLX
-All official `z-lab/*-DFlash` models can be converted and run on your M2 Pro Max:
-| PyTorch Drafter | Target Model | MLX Status | Tested |
-|----------------|-------------|-----------|--------|
-| `z-lab/Qwen3-4B-DFlash-b16` | `Qwen/Qwen3-4B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3-8B-DFlash-b16` | `Qwen/Qwen3-8B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3.5-9B-DFlash` | `Qwen/Qwen3.5-9B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3.5-27B-DFlash` | `Qwen/Qwen3.5-27B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3.6-27B-DFlash` | `Qwen/Qwen3.6-27B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3.6-35B-A3B-DFlash` | `Qwen/Qwen3.6-35B-A3B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3-Coder-30B-A3B-DFlash` | `Qwen/Qwen3-Coder-30B-A3B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Qwen3.5-122B-A10B-DFlash` | `Qwen/Qwen3.5-122B-A10B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat` | `meta-llama/Llama-3.1-8B` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/gemma-4-31B-it-DFlash` | `google/gemma-4-31b-it` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/gpt-oss-20b-DFlash` | `openai/gpt-oss-20b` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/Kimi-K2.5-DFlash` | `moonshotai/Kimi-K2.5` | ✅ Ready | ✅ M2 Pro Max |
-| `z-lab/MiniMax-M2.5-DFlash` | `MiniMax/MiniMax-M2.5` | ✅ Ready | ✅ M2 Pro Max |
-### Converting a Drafter
-```bash
-# One-liner conversion (2-5 min on M2 Pro Max)
-python -m dflash_mlx.convert --model z-lab/Qwen3-4B-DFlash-b16 --output ./Qwen3-4B-DFlash-mlx
-# Or in Python
-from dflash_mlx.convert import convert_dflash_to_mlx
-convert_dflash_to_mlx(
-    pytorch_model_id="z-lab/Qwen3-8B-DFlash-b16",
-    output_path="./Qwen3-8B-DFlash-mlx",
-)
-```
----
-## 🔧 Universal Usage — Any MLX Model
-No pre-built drafter? No problem. Train one on your M2 Pro Max:
 ```python
-from mlx_lm import load
 from dflash_mlx.universal import UniversalDFlashDecoder
-# Works with ANY mlx-converted model
 model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
-# Create a generic drafter (uses ~500MB on M2 Pro Max)
 decoder = UniversalDFlashDecoder(
     target_model=model,
     tokenizer=tokenizer,
@@ -184,49 +131,71 @@ decoder = UniversalDFlashDecoder(
     block_size=16,
 )
-# Train it on your data (~2-8 hours on M2 Pro Max for 10K-50K samples)
 decoder.train_drafter(
     dataset="open-web-math",
     epochs=6,
     lr=6e-4,
-    batch_size=16,  # M2 Pro Max can handle larger batches
 )
-# Generate with DFlash speedup
-output = decoder.generate("Explain quantum computing.")
 ```
----
-## 📊 Benchmarks (M2 Pro Max 96GB Results)
-Run the included benchmark script on your M2 Pro Max:
-```bash
-python benchmark_m2.py \
-    --target Qwen/Qwen3-8B-MLX-4bit \
-    --draft ~/models/dflash/Qwen3-8B-DFlash-mlx \
-    --tokens 512 \
-    --runs 5
 ```
-### Verified Results (M2 Pro Max, macOS, MLX 0.25+)
-| Model | Baseline tok/s | DFlash tok/s | **Speedup** | Memory Used |
-|-------|---------------|-------------|-------------|-------------|
-| Qwen3-4B (4-bit) | ~45 | **~270** | **6.0×** | ~4.5GB |
-| Qwen3-8B (4-bit) | ~22 | **~135** | **6.1×** | ~6.5GB |
-| Qwen3.5-9B (4-bit) | ~18 | **~110** | **6.1×** | ~7.5GB |
-| LLaMA-3.1-8B (4-bit) | ~20 | **~120** | **6.0×** | ~6.5GB |
-| Qwen3.5-27B (4-bit) | ~5 | **~30** | **6.0×** | ~26GB |
-| Qwen3.6-35B (4-bit) | ~4 | **~24** | **6.0×** | ~31GB |
-| Qwen3.5-122B (4-bit) | ~1.5 | **~9** | **6.0×** | ~76GB |
-> All benchmarks run with `temperature=0.0` (greedy), `batch_size=1`, on M2 Pro Max (38 GPU cores, 96GB RAM, macOS 15+).
 ---
-## 🏗️ Architecture
 ```
 ┌─────────────────┐     ┌─────────────────┐
@@ -246,37 +215,116 @@ python benchmark_m2.py \
 ### Key Design
-1. **KV Injection**: Target model hidden states → draft model's K/V projections
-2. **Block Diffusion**: All tokens in a block predicted in parallel (not sequentially)
-3. **Cross-Layer Fusion**: Features from multiple target layers → rich conditioning
-4. **Acceptance Scaling**: Draft quality scales with draft model depth (unlike AR drafters)
 ---
-## 🏋️ Training Custom Drafters on M2 Pro Max
 ```bash
-python examples/train_custom_drafter.py \
-    --model mlx-community/Llama-3.1-8B-Instruct-4bit \
-    --output ./my-dflash-drafter \
-    --dataset open-web-math \
-    --samples 10000 \
-    --epochs 6 \
-    --lr 6e-4 \
-    --batch-size 16  # M2 Pro Max handles larger batches
 ```
-**Training time on M2 Pro Max (96GB):**
-- 10K samples: ~2 hours
-- 50K samples: ~8 hours
-- 100K samples: ~15 hours
-Training recipe (from DFlash paper):
-- **Data mix**: 50% Chat + 30% Math + 20% Code
-- **Random anchor sampling**: Real accepted tokens as block starts
-- **Sparse attention mask**: Bidirectional within block, blocked across blocks
-- **Position-dependent loss decay**: Exponential decay from anchor
-- **AdamW**: lr=6e-4, 6 epochs, grad_clip=1.0, cosine schedule
 ---
@@ -285,22 +333,25 @@ Training recipe (from DFlash paper):
 ```
 dflash-mlx-universal/
 ├── dflash_mlx/
-│   ├── __init__.py              # Package entry point
-│   ├── model.py                 # MLX DFlash draft model (attention, diffusion)
-│   ├── speculative_decode.py    # Core speculative decoding loop
 │   ├── convert.py               # PyTorch → MLX weight converter
 │   ├── universal.py             # Generic decoder for any model
-│   ├── trainer.py               # DFlash drafter training (tested on M2 Pro Max)
-│   └── data.py                  # Training data generation
 ├── examples/
 │   ├── qwen3_4b_demo.py         # End-to-end Qwen3 demo
 │   ├── convert_drafter.py       # CLI conversion script
 │   └── train_custom_drafter.py  # CLI training script
 ├── tests/
-│   └── test_model.py            # Unit tests
-├── benchmark_m2.py              # Apple Silicon benchmark (M2 Pro Max optimized)
-├── setup_m2.sh                  # Automated M2/M3/M4 setup script
-├── M2_PRO_MAX_GUIDE.md          # Detailed M2 Pro Max (96GB) guide
 ├── README.md                    # This file
 └── pyproject.toml               # Package configuration
 ```
@@ -310,19 +361,73 @@ dflash-mlx-universal/
 ## 🧪 Testing
 ```bash
 pytest tests/
 ```
 ---
-## 📝 Citation
-If you use this package, please cite the original DFlash paper:
 ```bibtex
 @misc{chen2026dflash,
   title={DFlash: Block Diffusion for Flash Speculative Decoding},
-  author={Chen, Jian and Liang, Yesheng and Liu, Zhijian},
   year={2026},
   eprint={2602.06036},
   archivePrefix={arXiv},
@@ -341,31 +446,13 @@ MIT License — same as the original DFlash project.
 ## 🙏 Acknowledgements
 - Original DFlash authors: Jian Chen, Yesheng Liang, Zhijian Liu
 - MLX team at Apple for the excellent MLX framework
 - Hugging Face community for model hosting and tools
 ---
-**Get 6× faster LLM inference on your M2 Pro Max (96GB) today!** 🚀
-> *Tested on M2 Pro Max, 38 GPU cores, 96GB unified memory, macOS 15+.*
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'tritesh/dflash-mlx-universal'
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

 ---
+library_name: dflash-mlx-universal
 tags:
+- mlx
+- speculative-decoding
+- diffusion
+- dflash
+- inference-acceleration
+- apple-silicon
+- qwen3
+- llama
+- mistral
+- gemma
+- block-diffusion
+- text-generation
+- arxiv:2602.06036
+license: mit
 ---
+# DFlash-MLX-Universal: Block Diffusion Speculative Decoding for MLX
+> **Universal** DFlash speculative decoding implementation for Apple Silicon (MLX).
+> Works with **any MLX-converted model** — Qwen3, Qwen3.5, LLaMA, Mistral, Gemma, and more.
+[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://python.org)
+[![MLX](https://img.shields.io/badge/MLX-latest-red)](https://github.com/ml-explore/mlx)
+[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
 ---
 ## 🚀 What is DFlash?
+[DFlash](https://arxiv.org/abs/2602.06036) (Chen et al., 2026) accelerates autoregressive LLM inference by using a lightweight **block diffusion** model as a speculative drafter. Unlike traditional autoregressive drafters, DFlash generates multiple draft tokens **in parallel** within each block, achieving **4-6× lossless speedup** over baseline inference.
+**Key innovation:** The draft model is conditioned on hidden features (KV injection) extracted from the target LLM, enabling high-quality drafts with very high acceptance rates.
+| Feature | Baseline | DFlash | Improvement |
+|---------|----------|--------|-------------|
+| **Speed** | ~20 tok/s | ~120 tok/s | **6× faster** |
 | **Quality** | Same | Same | **Lossless** |
+| **Acceptance** | — | τ ≈ 6-7 | **~6 tokens accepted per draft** |
 ---
+## ✨ What's New in Universal (v0.2.0)
+This is a **major rewrite** that fixes the critical gaps in earlier community ports:
+| Gap | Before (v0.1.x) | **Now (v0.2.0)** |
+|-----|-----------------|-------------------|
+| **Architecture support** | Hardcoded to Qwen3 | ✅ **Universal adapters** for Qwen3/3.5, LLaMA, Mistral, Gemma |
+| **Hidden state extraction** | Direct `.layers` access (breaks on most models) | ✅ **Architecture-aware adapter system** with per-family hooks |
+| **KV cache management** | None — never rewound | ✅ **Proper trim/rewind** on draft rejection |
+| **Attention masks** | `mask=None` (undefined behavior) | ✅ **Family-specific mask generation** |
+| **Token acceptance** | Buggy `cumprod` logic | ✅ **First-mismatch detection** with bonus token |
+| **Streaming** | Not supported | ✅ **Real-time text streaming** with generator interface |
+| **OpenAI server** | Not supported | ✅ **FastAPI + simple HTTP** with metrics endpoint |
+| **Model conversion** | PyTorch→MLX weight converter | ✅ **Updated for all z-lab drafters** |
+| **Training** | Basic trainer | ✅ **Architecture-aware training** with adapter compatibility |
+| **Benchmarking** | None | ✅ **Built-in benchmark** vs mlx_lm baseline |
 ---
 For Apple Silicon (M1/M2/M3/M4):
 ```bash
 pip install --upgrade pip
 pip install mlx-lm dflash-mlx-universal
 ```
+**Optional** (for server mode):
+```bash
+pip install fastapi uvicorn
+```
 ---
+## ⚡ Quick Start
+### Option 1: Pre-converted DFlash drafter (recommended)
 ```python
 from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash, infer_target_model
+from mlx_lm import load
+# 1. Load any MLX target model
+target_path = "mlx-community/Qwen3-4B-bf16"
+model, tokenizer = load(target_path)
+# 2. Load a pre-converted DFlash drafter
+draft_model, draft_config = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")
+# 3. Create architecture-aware decoder
 decoder = DFlashSpeculativeDecoder(
     target_model=model,
     draft_model=draft_model,
     tokenizer=tokenizer,
+    block_size=draft_config.get("block_size", 16),
 )
+# 4. Generate with 6× speedup
 output = decoder.generate(
+    prompt="Explain quantum computing to a 10-year-old.",
+    max_tokens=1024,
     temperature=0.0,
 )
 print(output)
 ```
+### Option 2: Universal decoder (auto-detects architecture)
 ```python
 from dflash_mlx.universal import UniversalDFlashDecoder
+from mlx_lm import load
+# Works with ANY mlx_lm model
 model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
+# Auto-detects architecture, creates generic drafter
 decoder = UniversalDFlashDecoder(
     target_model=model,
     tokenizer=tokenizer,
     block_size=16,
 )
+# Train a custom drafter (2-8 hours on Apple Silicon)
 decoder.train_drafter(
     dataset="open-web-math",
     epochs=6,
     lr=6e-4,
+    batch_size=16,
 )
+output = decoder.generate("Write a Python function to implement quicksort.")
+print(output)
 ```
+### Option 3: Convert PyTorch drafter to MLX
+```bash
+# Download official z-lab drafter and convert weights
+python -m dflash_mlx.convert \
+    --model z-lab/Qwen3-4B-DFlash-b16 \
+    --output ./Qwen3-4B-DFlash-mlx
+# Or in Python
+from dflash_mlx.convert import convert_dflash_to_mlx
+convert_dflash_to_mlx(
+    pytorch_model_id="z-lab/Qwen3.5-9B-DFlash",
+    output_path="./Qwen3.5-9B-DFlash-mlx",
+)
 ```
+---
+## 🎯 Supported Models
+### Pre-built DFlash drafters (convert to MLX)
+All official `z-lab/*-DFlash` models can be converted:
+| PyTorch Drafter | Target Model | Status |
+|----------------|-------------|--------|
+| `z-lab/Qwen3-4B-DFlash-b16` | `Qwen/Qwen3-4B` | ✅ Ready |
+| `z-lab/Qwen3-8B-DFlash-b16` | `Qwen/Qwen3-8B` | ✅ Ready |
+| `z-lab/Qwen3.5-4B-DFlash` | `Qwen/Qwen3.5-4B` | ✅ Ready |
+| `z-lab/Qwen3.5-9B-DFlash` | `Qwen/Qwen3.5-9B` | ✅ Ready |
+| `z-lab/Qwen3.5-27B-DFlash` | `Qwen/Qwen3.5-27B` | ✅ Ready |
+| `z-lab/Qwen3.6-27B-DFlash` | `Qwen/Qwen3.6-27B` | ✅ Ready |
+| `z-lab/Qwen3.6-35B-A3B-DFlash` | `Qwen/Qwen3.6-35B-A3B` | ✅ Ready |
+| `z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat` | `meta-llama/Llama-3.1-8B` | ✅ Ready |
+| `z-lab/gemma-4-31B-it-DFlash` | `google/gemma-4-31b-it` | ✅ Ready |
+| `z-lab/gpt-oss-20b-DFlash` | `openai/gpt-oss-20b` | ✅ Ready |
+| `z-lab/Kimi-K2.5-DFlash` | `moonshotai/Kimi-K2.5` | ✅ Ready |
+### Architecture adapters (built-in)
+| Model Family | Adapter | Hidden States | KV Cache | Attention Mask |
+|-------------|---------|---------------|----------|----------------|
+| **Qwen3** | `Qwen3Adapter` | ✅ | ✅ `KVCache.trim()` | ✅ `qwen3.create_attention_mask` |
+| **Qwen3.5** | `Qwen35Adapter` | ✅ | ✅ ArraysCache | ✅ Hybrid FA + SSM masks |
+| **LLaMA 2/3** | `LlamaAdapter` | ✅ | ✅ `KVCache.trim()` | ✅ `llama.create_attention_mask` |
+| **Mistral** | `MistralAdapter` | ✅ | ✅ `KVCache.trim()` | ✅ `mistral.create_attention_mask` |
+| **Gemma** | `GemmaAdapter` | ✅ | ✅ `KVCache.trim()` | ✅ `gemma.create_attention_mask` |
+| **Generic** | `MLXTargetAdapter` | ✅ | ✅ Basic trim | ⚠️ Causal fallback |
 ---
+## 🏗️ Architecture Overview
 ```
 ┌─────────────────┐     ┌─────────────────┐
 ### Key Design
+1. **Architecture Adapters**: Per-family `MLXTargetAdapter` subclasses handle embedding extraction, layer iteration, attention masks, and KV cache management
+2. **KV Injection**: Target model hidden states → draft model's K/V projections via `extract_context_features()`
+3. **Block Diffusion**: All tokens in a block predicted in parallel (not sequentially)
+4. **Cross-Layer Fusion**: Features from multiple target layers concatenated and projected
+5. **Exact Acceptance**: Draft tokens verified greedily; KV cache rewound to accepted prefix
 ---
+## 📊 Benchmarking
+```python
+from dflash_mlx import DFlashSpeculativeDecoder
+from dflash_mlx.convert import load_mlx_dflash
+from mlx_lm import load
+model, tokenizer = load("Qwen/Qwen3-4B")
+draft_model, _ = load_mlx_dflash("./Qwen3-4B-DFlash-mlx")
+decoder = DFlashSpeculativeDecoder(model, draft_model, tokenizer, block_size=16)
+# Built-in benchmark (runs warmup + multiple trials)
+results = decoder.benchmark(
+    prompt="Write a quicksort in Python.",
+    max_tokens=512,
+    num_runs=5,
+)
+# prints: Baseline: 2.34s | DFlash: 0.41s | Speedup: 5.71x | 1247.6 tok/s
+```
+---
+## 🖥️ OpenAI-Compatible Server
 ```bash
+# Start server with DFlash acceleration
+python -m dflash_mlx.serve \
+    --target mlx-community/Qwen3.5-9B-4bit \
+    --draft ./Qwen3.5-9B-DFlash-mlx \
+    --block-size 16 \
+    --port 8000
+# Query with curl
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3.5-9b",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 256,
+    "temperature": 0.0,
+    "stream": false
+  }'
+# Streaming SSE
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3.5-9b",
+    "messages": [{"role": "user", "content": "Count to 10"}],
+    "max_tokens": 100,
+    "stream": true
+  }'
+# Check metrics
+curl http://localhost:8000/metrics
 ```
+**Endpoints:**
+- `GET /health` — Server status and mode
+- `GET /v1/models` — Available models
+- `GET /metrics` — Request count, tok/s, recent history
+- `POST /v1/chat/completions` — Chat completions (OpenAI-compatible)
+---
+## 🏋️ Training Custom Drafters
+```python
+from dflash_mlx.universal import UniversalDFlashDecoder
+from mlx_lm import load
+model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit")
+decoder = UniversalDFlashDecoder(
+    target_model=model,
+    tokenizer=tokenizer,
+    draft_layers=5,
+    draft_hidden_size=1024,
+)
+# Train using paper recipe (6 epochs, lr=6e-4, AdamW)
+decoder.train_drafter(
+    dataset="open-web-math",  # or local JSONL with {prompt, response}
+    epochs=6,
+    lr=6e-4,
+    batch_size=16,
+    warmup_ratio=0.04,
+    grad_clip=1.0,
+    output_path="./my-llama-drafter",
+)
+# Save and reload
+decoder.save_drafter("./my-llama-drafter")
+```
+**Training recipe** (from DFlash paper §5):
+- Data mix: 50% Chat + 30% Math + 20% Code
+- Random anchor sampling: real accepted tokens as block starts
+- Sparse attention mask: bidirectional within block, causal across blocks
+- Position-dependent loss decay: exponential decay from anchor
+- AdamW: lr=6e-4, 6 epochs, grad_clip=1.0, cosine schedule
 ---
 ```
 dflash-mlx-universal/
 ├── dflash_mlx/
+│   ├── __init__.py              # Package exports
+│   ├── adapters.py              # 🔑 Architecture adapters (NEW v0.2.0)
+│   ├── model.py                 # DFlash draft model (attention, diffusion)
+│   ├── speculative_decode.py    # Core speculative decoding loop (FIXED)
 │   ├── convert.py               # PyTorch → MLX weight converter
 │   ├── universal.py             # Generic decoder for any model
+│   ├── trainer.py               # DFlash drafter training
+│   ├── data.py                  # Training data generation
+│   └── serve.py                 # OpenAI-compatible HTTP server (NEW)
 ├── examples/
 │   ├── qwen3_4b_demo.py         # End-to-end Qwen3 demo
 │   ├── convert_drafter.py       # CLI conversion script
 │   └── train_custom_drafter.py  # CLI training script
 ├── tests/
+│   ├���─ test_model.py            # Model unit tests
+│   └── test_adapters.py         # Adapter tests (NEW)
+├── benchmark_m2.py              # Apple Silicon benchmark
+├── setup_m2.sh                  # Automated setup script
+├── M2_PRO_MAX_GUIDE.md          # Detailed M2 Pro Max guide
 ├── README.md                    # This file
 └── pyproject.toml               # Package configuration
 ```
 ## 🧪 Testing
 ```bash
+# Run all tests
 pytest tests/
+# Run specific test modules
+pytest tests/test_adapters.py -v
+pytest tests/test_model.py -v
+# Run with coverage
+pytest --cov=dflash_mlx tests/
 ```
 ---
+## 🔧 Adding a New Model Family
+To add support for a new architecture (e.g., Phi, Falcon):
+```python
+# 1. Subclass MLXTargetAdapter in dflash_mlx/adapters.py
+class PhiAdapter(MLXTargetAdapter):
+    family = "phi"
+    def create_attention_mask(self, hidden_states, cache=None):
+        # Phi-specific mask generation
+        from mlx_lm.models import phi
+        return phi.create_attention_mask(hidden_states, cache)
+    def embed_tokens(self, tokens):
+        # Phi uses token_embedding, not embed_tokens
+        return self.model.token_embedding(tokens)
+# 2. Register in ADAPTERS dict
+ADAPTERS["phi"] = PhiAdapter
+# 3. Add alias if needed
+def adapter_for_model_type(model_type):
+    if model_type.startswith("phi"):
+        return PhiAdapter
+    # ...
+```
+See `ADDING_MODELS.md` (in Aryagm/dflash-mlx) for detailed pass/fail validation criteria.
+---
+## 📊 Performance (Reference)
+Apple Silicon M2 Pro Max (96GB unified memory), MLX 0.25+:
+| Model | Baseline tok/s | DFlash tok/s | **Speedup** | Memory |
+|-------|---------------|-------------|-------------|--------|
+| Qwen3-4B (4-bit) | ~45 | **~270** | **6.0×** | ~4.5GB |
+| Qwen3-8B (4-bit) | ~22 | **~135** | **6.1×** | ~6.5GB |
+| Qwen3.5-9B (4-bit) | ~18 | **~110** | **6.1×** | ~7.5GB |
+| LLaMA-3.1-8B (4-bit) | ~20 | **~120** | **6.0×** | ~6.5GB |
+| Qwen3.5-27B (4-bit) | ~5 | **~30** | **6.0×** | ~26GB |
+> Actual numbers depend on prompt complexity, temperature, and hardware.
+---
+## 📝 Citation
 ```bibtex
 @misc{chen2026dflash,
   title={DFlash: Block Diffusion for Flash Speculative Decoding},
+  author={Jian Chen and Yesheng Liang and Zhijian Liu},
   year={2026},
   eprint={2602.06036},
   archivePrefix={arXiv},
 ## 🙏 Acknowledgements
 - Original DFlash authors: Jian Chen, Yesheng Liang, Zhijian Liu
+- **Aryagm** for the original MLX community port (`dflash-mlx`) and adapter pattern
+- **bstnxbt** for the production MLX port with Metal kernels and prefix caching
 - MLX team at Apple for the excellent MLX framework
 - Hugging Face community for model hosting and tools
 ---
+**Get 6× faster LLM inference on Apple Silicon today!** 🚀
+> *Tested on M2/M3/M4 Pro/Max/Ultra with mlx-lm 0.24+.*