diffuse-cpp
/

LLaDA-8B-Instruct-GGUF

@@ -18,7 +18,7 @@ GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GS
 LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
-> **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119814) -- C. Esteban, 2026
 ## Available Quantizations
@@ -28,21 +28,26 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
-## Benchmark (AMD EPYC 4465P 12-Core, 64 tokens, steps=16, threads=12)
 ### Real Prompt Performance (Q4_K_M + entropy_exit)
-| Prompt type | tok/s | Steps used | Speedup |
-|---|---|---|---|
-| Factual ("Capital of France?") | **9.22** | 4 | 3.9x |
-| Translation ("Translate to French") | **10.23** | 3 | 4.6x |
-| Arithmetic ("15 x 23?") | **11.49** | 3 | 5.5x |
-| Code (is_prime function) | **2.53** | 15 | 1.1x |
-| Creative (poem, explanation) | 2.33 | 17 | 1.0x |
-entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never slower than baseline.
-### Quantization Comparison (low_confidence baseline)
 | Model | Size | tok/s | vs F16 |
 |-------|------|-------|--------|
@@ -52,10 +57,10 @@ entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never
 ### Summary
-- **~10 tok/s on easy real prompts** (Q4_K_M + entropy_exit)
-- **~6x faster than F16 baseline** on factual/translation tasks
 - **7.5x thread scaling** from 1 to 12 threads
-- **40+ tok/s peak** on synthetic benchmarks (single forward pass)
 Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
@@ -68,5 +73,9 @@ cmake -B build -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j$(nproc)
 # Generate with entropy_exit (recommended)
-python tools/generate.py     --model-dir /path/to/LLaDA-8B-Instruct     --gguf llada-8b-q4km.gguf     -p "What is the capital of France?"     -s 16 -t 12 --remasking entropy_exit
 ```

 LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
+> **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119813) — C. Esteban, 2026
 ## Available Quantizations
 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
+## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
 ### Real Prompt Performance (Q4_K_M + entropy_exit)
+| Prompt | B=64 tok/s | B=256 tok/s | Steps | vs llama.cpp |
+|---|---|---|---|---|
+| Capital of France? | 9.22 | **15.60** | 4 | 1.8x |
+| Translate to French | 10.23 | **21.78** | 3 | 2.6x |
+| 15 × 23? | 11.49 | **11.45** | 5 | 1.3x |
+| Translate to Spanish | 4.59 | **7.17** | 8 | 0.8x |
+| Python is_prime() | 2.53 | **3.12** | 17 | 0.4x |
+| Poem about ocean | 2.33 | **3.10** | 17 | 0.4x |
+| Why is sky blue? | 2.21 | **3.18** | 17 | 0.4x |
+| List the planets | 2.33 | **3.19** | 17 | 0.4x |
+*B = generation buffer size (tokens generated per call). llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware).*
+entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Never slower than baseline.
+### Quantization Comparison (low_confidence baseline, B=64)
 | Model | Size | tok/s | vs F16 |
 |-------|------|-------|--------|
 ### Summary
+- **11–22 tok/s on easy real prompts** (Q4_K_M + entropy_exit, B=256)
+- **Up to 2.6x faster than llama.cpp** on the same hardware
+- **256-token generation** with 20% lower per-token cost vs 64-token batches
 - **7.5x thread scaling** from 1 to 12 threads
 Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
 cmake --build build -j$(nproc)
 # Generate with entropy_exit (recommended)
+python tools/generate.py \
+    --model-dir /path/to/LLaDA-8B-Instruct \
+    --gguf llada-8b-q4km.gguf \
+    -p "What is the capital of France?" \
+    -n 256 -s 16 -t 12 --remasking entropy_exit
 ```