Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit Β·
bdefa11
1
Parent(s): 17d0a5b
add vllm inference benchmark report and some new rules to agents.md
Browse files- AGENTS.md +2 -0
- app/src/content/chapters/infrastructure.mdx +215 -2
AGENTS.md
CHANGED
|
@@ -25,6 +25,8 @@ Use these blog posts as inspiration for writing style:
|
|
| 25 |
- **No em-dashes (β)**: Use parentheses, commas, or separate sentences instead
|
| 26 |
- **En-dashes (β)** for ranges only: "2020β2024", "pages 10β15"
|
| 27 |
- **Minimal semicolons**: Prefer two sentences over one with a semicolon
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Formatting
|
| 30 |
|
|
|
|
| 25 |
- **No em-dashes (β)**: Use parentheses, commas, or separate sentences instead
|
| 26 |
- **En-dashes (β)** for ranges only: "2020β2024", "pages 10β15"
|
| 27 |
- **Minimal semicolons**: Prefer two sentences over one with a semicolon
|
| 28 |
+
- **Angle brackets in MDX**: Use `{'<'}` and `{'>'}` to escape literal `<` and `>` characters in prose, since MDX interprets bare `<` as the start of a JSX tag
|
| 29 |
+
- **Curly braces in MDX**: Avoid bare `{` and `}` in prose since MDX interprets them as JSX expressions. Remove them or replace with parentheses/other punctuation
|
| 30 |
|
| 31 |
## Formatting
|
| 32 |
|
app/src/content/chapters/infrastructure.mdx
CHANGED
|
@@ -265,9 +265,222 @@ monitor_pipeline = [InferenceProgressMonitor(params=params, update_interval=3600
|
|
| 265 |
## Final card generation (runs after inference completes)
|
| 266 |
datacard_pipeline = [InferenceDatasetCardGenerator(params=params)]
|
| 267 |
```
|
| 268 |
-
### Scaling Throughput from
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
|
| 270 |
-
For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical, as it could accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models. We run the experiments once for a pre-training dataset and once for a post-training example.
|
| 271 |
|
| 272 |
The Flash-Attn vLLM backend is more than 50% faster than FlashInfer [@flashinfer] across setups. This aligns with vLLM's [backend priority](https://docs.vllm.ai/en/latest/design/attention_backends/#backend-priority-cuda): on Ampere/Hopper (SM 8.xβ9.x) Flash Attention is tried first, whereas on Blackwell (SM 10.x) FlashInfer has priority and may be faster there.
|
| 273 |
|
|
|
|
| 265 |
## Final card generation (runs after inference completes)
|
| 266 |
datacard_pipeline = [InferenceDatasetCardGenerator(params=params)]
|
| 267 |
```
|
| 268 |
+
### Scaling Throughput from 100M to 100B+ parameters
|
| 269 |
+
|
| 270 |
+
For synthetic data generation, we may run language model inference for millions of GPU hours. Finding a configuration that maximizes throughput is critical, as it could accelerate generation by days and save thousands of dollars. In this section, we describe our experiments to identify optimal parameters for a selection of popular models.
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
# vLLM Inference Benchmark
|
| 274 |
+
|
| 275 |
+
The entire benchmarking code (experiment launcher, analysis scripts, and sample configs) is available as a DataTrove example: [github.com/huggingface/datatrove/tree/main/examples/inference/benchmark](https://github.com/huggingface/datatrove/tree/main/examples/inference/benchmark).
|
| 276 |
+
|
| 277 |
+
## 1. Benchmarking Setup
|
| 278 |
+
|
| 279 |
+
We benchmarked **18 models** spanning 4 size categories (tiny to large) on **H100 GPUs** (8 GPUs per node) using vLLM as the inference engine. The goal was to find the optimal serving configuration for each model to maximize output tokens per second per GPU.
|
| 280 |
+
|
| 281 |
+
### Task
|
| 282 |
+
|
| 283 |
+
All models were evaluated on the same task: rewriting documents from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT split) as step-by-step tutorials. Each run processed up to 10,000 examples with:
|
| 284 |
+
- **Model max context**: 8,192 tokens
|
| 285 |
+
- **Max output tokens**: 4,096 tokens
|
| 286 |
+
- **Temperature**: 0.0 (deterministic, seed=42 for reproducibility)
|
| 287 |
+
|
| 288 |
+
Since all runs use temperature 0.0 and a fixed seed, the variance across runs is negligible. We therefore report single-run throughput numbers without confidence intervals.
|
| 289 |
+
|
| 290 |
+
### Models
|
| 291 |
+
|
| 292 |
+
| Category | Models |
|
| 293 |
+
|----------|--------|
|
| 294 |
+
| π£ **Tiny** ({'<'}1B) | SmolLM2-135M-Instruct, SmolLM2-360M-Instruct, gemma-3-270m-it, Qwen3-0.6B |
|
| 295 |
+
| π¦ **Small** (1Bβ10B) | SmolLM2-1.7B-Instruct, gemma-3-1b-it, gemma-3-4b-it, Qwen3-1.7B, Qwen3-4B, Qwen3-8B |
|
| 296 |
+
| π¦
**Medium** (10Bβ100B) | gemma-3-12b-it, gemma-3-27b-it, Qwen3-14B, Qwen3-32B, Qwen3-30B-A3B, Qwen3-Next-80B-A3B, gpt-oss-20b |
|
| 297 |
+
| π¦ **Large** (100Bβ500B) | gpt-oss-120b, Qwen3-235B-A22B |
|
| 298 |
+
|
| 299 |
+
The lineup includes both π§± dense transformers and π Mixture-of-Experts (MoE) architectures.
|
| 300 |
+
|
| 301 |
+
### Infrastructure
|
| 302 |
+
|
| 303 |
+
- **Hardware**: NVIDIA H100 80GB GPUs, 8 per node
|
| 304 |
+
- **Inference engine**: vLLM with automatic prefix caching enabled and flash_attn backend
|
| 305 |
+
- **Job orchestration**: DataTrove's `SlurmPipelineExecutor` via `launch_experiments.py`
|
| 306 |
+
|
| 307 |
+
## 2. Design Choices
|
| 308 |
+
|
| 309 |
+
### Tiered Optimization
|
| 310 |
+
|
| 311 |
+
We adopted a **two-tier sequential optimization** approach. The second tier builds on the best configuration found in the previous tier:
|
| 312 |
+
|
| 313 |
+
| Tier | Parameters Swept | Goal |
|
| 314 |
+
|------|-----------------|------|
|
| 315 |
+
| **Tier 0** | `tp` (tensor parallelism), `mns` (max-num-seqs), `mnbt` (max-num-batched-tokens) | Find optimal parallelism and batching configuration |
|
| 316 |
+
| **Tier 1** | `gmu` (gpu-memory-utilization), `spec` (speculative decoding method) | Achieve lossless speedup through speculation and memory tuning |
|
| 317 |
+
|
| 318 |
+
We call it "tier 0" because these parameters are prerequisites: for larger models, getting `tp` right is not an optimization but a necessity -- without sufficient tensor parallelism the model either doesn't fit in memory or leaves almost no room for the KV cache. In earlier exploratory experiments, we found that `tp`, `mns`, and `mnbt` have by far the largest impact on throughput, which is why they form tier 0.
|
| 319 |
+
|
| 320 |
+
**Tier 0** determines how many GPUs the model needs and how many sequences can be processed in parallel. The sweep covers:
|
| 321 |
+
- **tp**: 1, 2, 4, (8 for large models) -- tensor parallelism across GPUs
|
| 322 |
+
- **mns**: 256, 512, 1024, 2048, 4096 -- maximum concurrent sequences
|
| 323 |
+
- **mnbt**: 8192, 16384, 32768 -- maximum tokens per forward pass
|
| 324 |
+
|
| 325 |
+
**Tier 1** uses the best tp/mns/mnbt from tier 0 and additionally sweeps:
|
| 326 |
+
- **gmu**: 0.9, 0.95 -- fraction of GPU memory allocated to the KV cache
|
| 327 |
+
- **spec**: none, ngram-6, ngram-8, suffix-32 -- speculative decoding methods
|
| 328 |
+
|
| 329 |
+
This tiered approach reduces the search space dramatically. A full Cartesian product of all parameters would require ~600 configurations per model; the tiered approach needs only ~15+8 = ~23 per model.
|
| 330 |
+
|
| 331 |
+
### Timeout Strategy
|
| 332 |
+
|
| 333 |
+
All jobs were given a **~2 hour** SLURM time limit (`1:59:00`). This is deliberately aggressive -- configurations that cannot complete 10,000 examples within 2 hours are not competitive. Bad configurations fail fast via OOM or timeout, and we simply skip them. This lets us cast a wide net without wasting cluster time on hopeless configurations.
|
| 334 |
+
|
| 335 |
+
Failure modes are automatically classified:
|
| 336 |
+
- **OOM**: Out-of-memory during model loading
|
| 337 |
+
- **timeout**: SLURM time limit exceeded (configuration too slow)
|
| 338 |
+
- **server_fail**: vLLM server failed to start (e.g., engine core initialization failure, insufficient GPU memory for the model at the given tp)
|
| 339 |
+
|
| 340 |
+
## 3. Scale of the Experiment
|
| 341 |
+
|
| 342 |
+
The benchmark config defines **801 unique configurations** across 8 experiment groups (18 models with ~23 configurations each via the tiered approach):
|
| 343 |
+
|
| 344 |
+
| Experiment | Configs | Description |
|
| 345 |
+
|-----------|---------|-------------|
|
| 346 |
+
| tier0-tiny | 60 | 4 models Γ tp=1 Γ 5 mns Γ 3 mnbt |
|
| 347 |
+
| tier0-small | 180 | 6 models Γ tp=1,2 Γ 5 mns Γ 3 mnbt |
|
| 348 |
+
| tier0-medium | 315 | 7 models Γ tp=1,2,4 Γ 5 mns Γ 3 mnbt |
|
| 349 |
+
| tier0-large | 120 | 2 models Γ tp=1,2,4,8 Γ 5 mns Γ 3 mnbt |
|
| 350 |
+
| tier1-tiny | 32 | 4 models Γ 2 gmu Γ 4 spec |
|
| 351 |
+
| tier1-small | 48 | 6 models Γ 2 gmu Γ 4 spec |
|
| 352 |
+
| tier1-medium | 56 | 7 models Γ 2 gmu Γ 4 spec |
|
| 353 |
+
| tier1-large | 8 | 1 model Γ 2 gmu Γ 4 spec |
|
| 354 |
+
|
| 355 |
+
## 4. Results
|
| 356 |
+
|
| 357 |
+
### Optimization Summary
|
| 358 |
+
|
| 359 |
+
The table below shows the progression from baseline (vLLM defaults) through tier 0 and tier 1 optimization:
|
| 360 |
+
|
| 361 |
+
| Model | Base tps/gpu | Tier0 tps/gpu | T0 Speedup | Tier1 tps/gpu | T1 Speedup | Best tps/gpu | Speedup |
|
| 362 |
+
|-------|-------------|--------------|------------|--------------|------------|-------------|---------|
|
| 363 |
+
| π gpt_oss_120b | 3,138 | 6,117 | **1.95x** | 5,450 | 1.74x | 6,117 | **1.95x** |
|
| 364 |
+
| π Qwen3_30B_A3B | 2,977 | 5,310 | **1.78x** | 5,064 | 1.70x | 5,310 | **1.78x** |
|
| 365 |
+
| π§± SmolLM2_1.7B | 5,255 | 5,437 | 1.03x | 9,220 | **1.75x** | 9,220 | **1.75x** |
|
| 366 |
+
| π§± SmolLM2_135M | 28,391 | 31,186 | 1.10x | 45,540 | **1.60x** | 45,540 | **1.60x** |
|
| 367 |
+
| π§± SmolLM2_360M | 17,887 | 18,844 | 1.05x | 23,996 | **1.34x** | 23,996 | **1.34x** |
|
| 368 |
+
| π Qwen3_Next_80B_A3B | 2,034 | 2,678 | **1.32x** | 2,481 | 1.22x | 2,678 | **1.32x** |
|
| 369 |
+
| π gpt_oss_20b | 12,432 | 14,671 | **1.18x** | 13,004 | 1.05x | 14,671 | **1.18x** |
|
| 370 |
+
| π§± gemma_3_1b_it | 14,838 | 16,762 | 1.13x | 13,832 | 0.93x | 16,762 | 1.13x |
|
| 371 |
+
| π§± gemma_3_4b_it | 8,501 | 9,253 | 1.09x | 8,361 | 0.98x | 9,253 | 1.09x |
|
| 372 |
+
| π§± Qwen3_1.7B | 11,710 | 12,313 | 1.05x | 11,262 | 0.96x | 12,313 | 1.05x |
|
| 373 |
+
| π§± Qwen3_32B | 1,987 | 2,072 | 1.04x | 2,078 | 1.05x | 2,078 | 1.05x |
|
| 374 |
+
| π§± Qwen3_0.6B | 13,527 | 14,069 | 1.04x | 12,330 | 0.91x | 14,069 | 1.04x |
|
| 375 |
+
| π§± Qwen3_14B | 4,414 | 4,549 | 1.03x | 4,158 | 0.94x | 4,549 | 1.03x |
|
| 376 |
+
| π§± gemma_3_270m_it | 22,996 | 23,585 | 1.03x | 21,030 | 0.91x | 23,585 | 1.03x |
|
| 377 |
+
| π§± Qwen3_4B | 7,919 | 8,086 | 1.02x | 7,751 | 0.98x | 8,086 | 1.02x |
|
| 378 |
+
| π§± gemma_3_12b_it | 2,999 | 2,999 | 1.00x | 3,046 | 1.02x | 3,046 | 1.02x |
|
| 379 |
+
| π§± Qwen3_8B | 6,338 | 6,338 | 1.00x | 6,443 | 1.02x | 6,443 | 1.02x |
|
| 380 |
+
| π§± gemma_3_27b_it | 1,724 | 1,724 | 1.00x | 1,671 | 0.97x | 1,724 | 1.00x |
|
| 381 |
+
|
| 382 |
+
### Key Findings
|
| 383 |
+
|
| 384 |
+
1. **Tier 0 (parallelism/batching) delivers the biggest wins for large/MoE models.** gpt_oss_120b gained 1.95x and Qwen3_30B_A3B gained 1.78x purely from finding the right tp and batch sizes.
|
| 385 |
+
|
| 386 |
+
2. **Tier 1 (speculative decoding) delivers the biggest wins for small models.** SmolLM2 models gained 1.34x-1.75x from speculative decoding, with the best methods being suffix-32 (SmolLM2-1.7B) and ngram-6 (SmolLM2-135M, SmolLM2-360M).
|
| 387 |
+
|
| 388 |
+
3. **Tier 1 often hurts performance for models that are already well-tuned.** For 8 out of 18 models, the tier 1 "best" was worse than the tier 0 best. This is because speculative decoding adds overhead that doesn't pay off when the model is already compute-saturated.
|
| 389 |
+
|
| 390 |
+
4. **Many models are near-optimal with defaults.** gemma_3_27b_it, gemma_3_12b_it, and Qwen3_8B saw essentially no improvement (0-2%), suggesting vLLM's defaults are well-chosen for these model sizes.
|
| 391 |
+
|
| 392 |
+
## 5. Why Do Some Models See Larger Improvements?
|
| 393 |
+
|
| 394 |
+
### Background: Memory-Bound vs Compute-Bound Inference
|
| 395 |
+
|
| 396 |
+
LLM inference has two phases: **prefill** (processing the input prompt in parallel) and **decode** (generating tokens one at a time, reusing the cached KV states). Prefill is typically **compute-bound** -- a single long prompt can saturate the GPU's arithmetic units. Decode is typically **memory-bandwidth-bound** -- each step requires reading the full model weights and KV cache from HBM but produces only one token, leaving the GPU's compute units underutilized ([Prefill Decode, 2025](https://prefilldecode.com/); [Qin et al., 2025](https://arxiv.org/abs/2512.22066)).
|
| 397 |
+
|
| 398 |
+
**Memory-bound** decode is the typical regime for large models or long sequences: the GPU spends most of its time waiting for data transfers from HBM rather than computing. Increasing tensor parallelism (tp) helps because it splits the model across GPUs, reducing per-GPU memory pressure and freeing space for a larger KV cache, which enables higher batch sizes and better throughput. The [Ultrascale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) provides a thorough treatment of how memory, compute, and communication trade off during distributed training and inference.
|
| 399 |
+
|
| 400 |
+
**Compute-bound** decode occurs for small models at high batch sizes: the model fits easily in memory, but each forward pass still takes a fixed amount of compute per token. Speculative decoding helps in this regime by generating multiple tokens per verification step, effectively amortizing the per-token compute cost ([Leviathan et al., 2023](https://arxiv.org/abs/2211.17192)).
|
| 401 |
+
|
| 402 |
+
### Background: When Does Speculative Decoding Help?
|
| 403 |
+
|
| 404 |
+
Speculative decoding ([Leviathan et al., 2023](https://arxiv.org/abs/2211.17192)) works by generating draft tokens cheaply and then verifying them in a single batched forward pass of the target model. The speedup depends on the **draft acceptance rate** -- how often the draft tokens match what the target model would have generated. When acceptance is high, multiple tokens are produced per forward pass, amortizing the cost. When acceptance is low, the overhead of drafting and verification can make inference *slower* than standard decoding ([vLLM Blog, 2024](https://vllm-project.github.io/2024/10/17/spec-decode.html)).
|
| 405 |
+
|
| 406 |
+
We benchmarked two **model-free** speculative decoding methods (i.e., no additional draft model weights required):
|
| 407 |
+
|
| 408 |
+
- **N-gram speculation** ([Prompt Lookup Decoding](https://github.com/apoorvumang/prompt-lookup-decoding)) builds an n-gram lookup table from the prompt and matches the most recent generated tokens against it to propose continuations. It works best when the output closely mirrors the input text (e.g., extraction, summarization, or rephrasing tasks where phrases are reused verbatim). In vLLM, the `num_speculative_tokens` parameter controls how many tokens are proposed per step ([vLLM Docs](https://docs.vllm.ai/en/latest/features/spec_decode.html)).
|
| 409 |
+
- **Suffix speculation** ([SuffixDecoding; Qiao et al., 2024](https://arxiv.org/abs/2411.04975)) maintains a suffix tree over the prompt and previous generations to identify repeating token sequences. Unlike n-gram, it uses frequency statistics to propose the most likely continuations and speculates an **adaptive** number of tokens per step (up to `num_speculative_tokens`, default 32). It was designed for agentic workloads with repetitive patterns and achieves up to 5.3x speedup on such tasks ([Snowflake Engineering Blog](https://www.snowflake.com/content/snowflake-site/global/en/engineering-blog/suffixdecoding-arctic-inference-vllm)).
|
| 410 |
+
|
| 411 |
+
Speculative decoding adds overhead: the verification step has a compute cost, and in vLLM both model-free methods currently disable asynchronous scheduling. At high QPS (queries per second), the vLLM team has measured up to 1.4-1.8x *slowdowns* from speculative decoding because the extra compute competes with already-saturated GPU resources ([vLLM Blog, 2024](https://vllm-project.github.io/2024/10/17/spec-decode.html)). This is why we observe tier 1 *hurting* performance for many models that are already well-tuned.
|
| 412 |
+
|
| 413 |
+
### Models with Large Speedups
|
| 414 |
+
|
| 415 |
+
#### gpt_oss_120b and Qwen3_30B_A3B (1.95x and 1.78x via tp=2)
|
| 416 |
+
|
| 417 |
+
Both are MoE models that are severely **memory-bound at tp=1**. gpt_oss_120b (120B total, ~12B active) fits on a single GPU but leaves almost no room for the KV cache: server logs show only ~45,520 tokens of KV capacity at tp=1 (5.5x max concurrency) vs ~810,000 tokens at tp=2 (98.8x max concurrency). Moving to tp=2 halves per-GPU model memory and roughly doubles KV cache capacity, allowing the scheduler to batch far more sequences. The same pattern holds for Qwen3_30B_A3B (30B total, ~3B active). For these large MoE models, tp>1 is critical not for compute parallelism but for **KV cache headroom** -- the compute overhead of cross-GPU communication is minimal because only the active parameters participate in each forward pass.
|
| 418 |
+
|
| 419 |
+
#### SmolLM2 models (1.34x-1.75x via speculative decoding)
|
| 420 |
+
|
| 421 |
+
**Root cause: Small models are compute-bound, and speculation amortizes the cost.** These models are tiny enough that a single GPU has abundant memory. The bottleneck is the sequential nature of autoregressive decoding. Speculative decoding generates multiple tokens per verification step:
|
| 422 |
+
|
| 423 |
+
- **SmolLM2-135M with ngram-6**: Server logs show 72-84% draft acceptance rate with mean acceptance length of 5.3-6.0 tokens. This means each verification step produces ~5-6 tokens instead of 1.
|
| 424 |
+
- **SmolLM2-1.7B with suffix-32**: 48-53% acceptance rate with mean acceptance length of 2.5-3.1 tokens.
|
| 425 |
+
|
| 426 |
+
Interestingly, **ngram works better for the 135M model but suffix wins for the 1.7B model**. The 135M model produces more repetitive, template-like text that closely mirrors input phrasing, giving n-gram matching high acceptance rates (72-84%). The 1.7B model generates more diverse, paraphrased output where n-gram acceptance drops to 63-66%. Despite suffix-32 having a lower per-token acceptance rate (~48%), it speculates 32 tokens per step and verifies them in a single large batch, which is more GPU-efficient than n-gram's smaller 6-8 token batches. The net effect is that suffix-32 achieves ~9.2k tps vs ngram-6's ~8.3k tps for the 1.7B model.
|
| 427 |
+
|
| 428 |
+
**Contrast with models where speculation hurts.** The server logs reveal a stark difference in draft acceptance rates between models that benefit from speculation and those that don't (all using ngram-6):
|
| 429 |
+
|
| 430 |
+
| Model | Avg Acceptance Rate | Mean Acceptance Length | Throughput Impact |
|
| 431 |
+
|-------|-------------------|----------------------|-------------------|
|
| 432 |
+
| SmolLM2-135M | 72-84% | 5.3-6.0 | **+60%** |
|
| 433 |
+
| SmolLM2-1.7B | 64-68% | 4.9-5.1 | **+58%** (ngram-6) |
|
| 434 |
+
| gemma_3_270m_it | 63-83% | 4.8-6.0 | β2% |
|
| 435 |
+
| Qwen3_14B | 23-50% | 2.4-4.0 | β16% |
|
| 436 |
+
| gemma_3_12b_it | 20-24% | 2.2-2.4 | β8% |
|
| 437 |
+
| gemma_3_27b_it | 19-26% | 2.1-2.6 | β11% |
|
| 438 |
+
| gpt_oss_120b | 20-31% | 2.2-2.9 | β16% |
|
| 439 |
+
|
| 440 |
+
The small SmolLM2 models achieve 64-84% acceptance rates with 5-6 tokens accepted per step, making speculation highly profitable. The medium/large models (Qwen3_14B, gemma_3_12b/27b, gpt_oss_120b) only achieve 20-30% acceptance with ~2.3 tokens per step -- barely better than no speculation. A likely explanation is that larger models generate more diverse, paraphrased text that diverges further from the input prompt, giving n-gram matching fewer opportunities for exact phrase reuse. At these low acceptance rates, the overhead of drafting and verifying rejected tokens outweighs the benefit.
|
| 441 |
+
|
| 442 |
+
**gemma_3_270m_it is the most instructive outlier**: it achieves high acceptance rates (63-83%) comparable to SmolLM2-135M, yet speculation still hurts throughput by 2%. Despite being similarly small (~270M vs ~135M parameters), two key architectural differences explain the discrepancy:
|
| 443 |
+
|
| 444 |
+
1. **Vocabulary size**: Gemma 3's vocabulary is **256k tokens** vs SmolLM2's **49k tokens** -- a 5.2x difference. The rejection sampler in vLLM's speculative decoding calls `logits.sort(dim=-1)` over the full vocabulary during verification. This means each verification step does 5.2x more work for Gemma 3, making the per-step overhead much higher. In fact, 170M of Gemma 3's 270M parameters (63%) are just embeddings, so the model's "effective" transformer is much smaller than SmolLM2's.
|
| 445 |
+
2. **Concurrency**: SmolLM2-135M's best tier1 config uses mns=512 (~500 concurrent sequences, 43% KV cache utilization), while gemma_3_270m uses mns=256 (~250 sequences, only 5% KV cache utilization). SmolLM2 has more concurrent work in flight, which gives the GPU more opportunities to overlap verification compute with ongoing decode work.
|
| 446 |
+
|
| 447 |
+
The net effect: Gemma 3's large vocabulary makes each speculative verification step disproportionately expensive, and the overhead isn't compensated by enough throughput gain from the accepted tokens.
|
| 448 |
+
|
| 449 |
+
The tutorial-rewriting task is particularly amenable to speculative decoding because the output frequently contains phrases from the input document, giving both ngram and suffix methods high acceptance rates. Tasks that preserve even more of the input text -- such as summarization, text continuation, or guided rewriting (where the model is explicitly asked to maintain the original author's voice) -- would likely see even larger speedups from speculative decoding, since draft acceptance rates would be higher.
|
| 450 |
+
|
| 451 |
+
### Models with Small/No Speedups
|
| 452 |
+
|
| 453 |
+
#### gemma_3_27b_it (1.00x -- baseline optimal)
|
| 454 |
+
|
| 455 |
+
**Root cause: Already well-balanced at tp=2.** The baseline configuration (tp=2, mns=256, mnbt=8192) already achieves 97-98% KV cache utilization with sufficient concurrency. There is no memory bottleneck to relieve and no compute slack for speculation to exploit.
|
| 456 |
+
|
| 457 |
+
Notably, speculative decoding consistently fails or degrades performance across **all** Gemma 3 model sizes:
|
| 458 |
+
|
| 459 |
+
- **gemma_3_1b_it**: Crashes with all spec methods (`server_fail`). The root cause is a **CUDA OOM during the rejection sampler warmup**. vLLM's speculative decoding verification step calls `logits.sort(dim=-1)` over the full vocabulary during CUDA graph warmup. Gemma 3's large vocabulary (~258k tokens) requires ~12 GiB for this sort operation alone. Under the tier1-small config (`mns=4096`, `mnbt=32768`), speculative decoding also reduces available KV cache (18.8 GiB vs 31.3 GiB without spec), leaving only ~6.5 GiB free -- far short of the 12 GiB needed. This is a vLLM-specific issue: the rejection sampler's full-vocabulary sort during warmup is a memory bottleneck for large-vocabulary models under high concurrency settings.
|
| 460 |
+
- **gemma_3_270m_it**: Spec decoding runs successfully but *hurts* throughput: ngram-6 and ngram-8 show ~2% regression, suffix-32 shows ~18% regression (from 21k to 17.8k tps).
|
| 461 |
+
- **gemma_3_4b_it**: 2% regression with spec decoding.
|
| 462 |
+
- **gemma_3_27b_it**: 3% regression with spec decoding.
|
| 463 |
+
|
| 464 |
+
This consistent pattern across all Gemma 3 sizes (where the model doesn't OOM) suggests an architectural incompatibility. Gemma 3 uses a distinctive **5:1 local/global alternating attention** pattern with a 1024-token sliding window, which differs from the standard full-attention used by Qwen3 and SmolLM2. The alternating attention pattern may interact poorly with speculative decoding's draft-and-verify mechanism, as the sliding window means that attention patterns are position-dependent in ways that make draft token verification less efficient.
|
| 465 |
+
|
| 466 |
+
#### gemma_3_12b_it, Qwen3_14B, Qwen3_8B (1.00x-1.03x)
|
| 467 |
+
|
| 468 |
+
**Root cause: "Goldilocks" model sizes.** These 8B-14B dense models fit comfortably on 1-2 GPUs with enough KV cache for high concurrency (~98% utilization at baseline). They are neither memory-bound (so increasing tp doesn't help -- it just adds cross-GPU communication overhead without freeing meaningful KV cache space) nor compute-bound enough for speculation to pay off (the overhead of disabling async scheduling and running verification passes outweighs the benefit of generating a few extra tokens per step). The vLLM defaults are essentially optimal for this size range on H100 GPUs.
|
| 469 |
+
|
| 470 |
+
### Summary of Patterns
|
| 471 |
+
|
| 472 |
+
| Factor | Helps When | Doesn't Help When |
|
| 473 |
+
|--------|-----------|-------------------|
|
| 474 |
+
| **Increasing tp** | Model is memory-bound (large MoE at tp=1) | Model already fits with good KV headroom |
|
| 475 |
+
| **Increasing mns/mnbt** | KV cache has room for more sequences | KV cache is already saturated |
|
| 476 |
+
| **Speculative decoding** | Model is compute-bound (small models) AND task has predictable outputs | Model is memory-bound or task outputs are unpredictable |
|
| 477 |
+
| **Increasing gmu** | KV cache is the bottleneck | Model weights already consume most memory |
|
| 478 |
+
|
| 479 |
+
The fundamental insight is that **optimization gains depend on identifying the bottleneck**: memory-bound models benefit from parallelism, compute-bound models benefit from speculation, and well-balanced models have little room for improvement.
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
|
| 483 |
|
|
|
|
| 484 |
|
| 485 |
The Flash-Attn vLLM backend is more than 50% faster than FlashInfer [@flashinfer] across setups. This aligns with vLLM's [backend priority](https://docs.vllm.ai/en/latest/design/attention_backends/#backend-priority-cuda): on Ampere/Hopper (SM 8.xβ9.x) Flash Attention is tried first, whereas on Blackwell (SM 10.x) FlashInfer has priority and may be faster there.
|
| 486 |
|