Carmenest commited on
Commit
e00b07f
·
verified ·
1 Parent(s): 0092801

Update model card with B=256 real-prompt benchmarks

Browse files
Files changed (1) hide show
  1. README.md +24 -15
README.md CHANGED
@@ -18,7 +18,7 @@ GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GS
18
 
19
  LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
20
 
21
- > **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119814) -- C. Esteban, 2026
22
 
23
  ## Available Quantizations
24
 
@@ -28,21 +28,26 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
28
  | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
29
  | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
30
 
31
- ## Benchmark (AMD EPYC 4465P 12-Core, 64 tokens, steps=16, threads=12)
32
 
33
  ### Real Prompt Performance (Q4_K_M + entropy_exit)
34
 
35
- | Prompt type | tok/s | Steps used | Speedup |
36
- |---|---|---|---|
37
- | Factual ("Capital of France?") | **9.22** | 4 | 3.9x |
38
- | Translation ("Translate to French") | **10.23** | 3 | 4.6x |
39
- | Arithmetic ("15 x 23?") | **11.49** | 3 | 5.5x |
40
- | Code (is_prime function) | **2.53** | 15 | 1.1x |
41
- | Creative (poem, explanation) | 2.33 | 17 | 1.0x |
 
 
 
42
 
43
- entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never slower than baseline.
44
 
45
- ### Quantization Comparison (low_confidence baseline)
 
 
46
 
47
  | Model | Size | tok/s | vs F16 |
48
  |-------|------|-------|--------|
@@ -52,10 +57,10 @@ entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never
52
 
53
  ### Summary
54
 
55
- - **~10 tok/s on easy real prompts** (Q4_K_M + entropy_exit)
56
- - **~6x faster than F16 baseline** on factual/translation tasks
 
57
  - **7.5x thread scaling** from 1 to 12 threads
58
- - **40+ tok/s peak** on synthetic benchmarks (single forward pass)
59
 
60
  Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
61
 
@@ -68,5 +73,9 @@ cmake -B build -DCMAKE_BUILD_TYPE=Release
68
  cmake --build build -j$(nproc)
69
 
70
  # Generate with entropy_exit (recommended)
71
- python tools/generate.py --model-dir /path/to/LLaDA-8B-Instruct --gguf llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12 --remasking entropy_exit
 
 
 
 
72
  ```
 
18
 
19
  LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
20
 
21
+ > **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119813) C. Esteban, 2026
22
 
23
  ## Available Quantizations
24
 
 
28
  | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
29
  | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
30
 
31
+ ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
32
 
33
  ### Real Prompt Performance (Q4_K_M + entropy_exit)
34
 
35
+ | Prompt | B=64 tok/s | B=256 tok/s | Steps | vs llama.cpp |
36
+ |---|---|---|---|---|
37
+ | Capital of France? | 9.22 | **15.60** | 4 | 1.8x |
38
+ | Translate to French | 10.23 | **21.78** | 3 | 2.6x |
39
+ | 15 × 23? | 11.49 | **11.45** | 5 | 1.3x |
40
+ | Translate to Spanish | 4.59 | **7.17** | 8 | 0.8x |
41
+ | Python is_prime() | 2.53 | **3.12** | 17 | 0.4x |
42
+ | Poem about ocean | 2.33 | **3.10** | 17 | 0.4x |
43
+ | Why is sky blue? | 2.21 | **3.18** | 17 | 0.4x |
44
+ | List the planets | 2.33 | **3.19** | 17 | 0.4x |
45
 
46
+ *B = generation buffer size (tokens generated per call). llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware).*
47
 
48
+ entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Never slower than baseline.
49
+
50
+ ### Quantization Comparison (low_confidence baseline, B=64)
51
 
52
  | Model | Size | tok/s | vs F16 |
53
  |-------|------|-------|--------|
 
57
 
58
  ### Summary
59
 
60
+ - **11–22 tok/s on easy real prompts** (Q4_K_M + entropy_exit, B=256)
61
+ - **Up to 2.6x faster than llama.cpp** on the same hardware
62
+ - **256-token generation** with 20% lower per-token cost vs 64-token batches
63
  - **7.5x thread scaling** from 1 to 12 threads
 
64
 
65
  Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
66
 
 
73
  cmake --build build -j$(nproc)
74
 
75
  # Generate with entropy_exit (recommended)
76
+ python tools/generate.py \
77
+ --model-dir /path/to/LLaDA-8B-Instruct \
78
+ --gguf llada-8b-q4km.gguf \
79
+ -p "What is the capital of France?" \
80
+ -n 256 -s 16 -t 12 --remasking entropy_exit
81
  ```