joecr-chandra-2-eagle3.1
An EAGLE‑3.1 speculative‑decoding draft head for datalab-to/chandra-ocr-2 (a 5B Qwen3.5‑based vision‑language OCR model). Drop it into vLLM as the speculator to accelerate single‑stream (latency‑bound) OCR decoding losslessly — every drafted token is verified by the target, so output quality is preserved.
Results
Speedup (single‑stream, greedy, vLLM + CUDA graphs, RTX 3090 Ti):
| tok/s | speedup | |
|---|---|---|
| baseline (no spec) | 58.0 | 1.0× |
| + this head (num_spec=5) | 104.0 | 1.79× |
Measured on olmOCR‑bench pages. The ratio is condition‑dependent: it's highest at low GPU power / low concurrency (the regime above) and drops toward ~1.4× at full power, and below 1× at high batch (saturated GPU — see note). Treat ~1.4–1.8× single‑stream as the realistic range.
Serving acceptance (held‑out OmniDocBench, 80 pages, num_spec=5):
| mean accepted length τ | per‑position a0…a4 | |
|---|---|---|
| this head | 2.31 | [0.49, 0.31, 0.22, 0.16, 0.12] |
(Note: in‑distribution training acc0 ≈ 0.92 — that's a teacher‑forced ceiling on the training corpus, not the serving number above. Real‑world acceptance depends on the document distribution.)
Quality: lossless in content — median char‑similarity vs baseline ≈ 0.999; differences are confined to bbox‑pixel jitter / equivalent figure captions (bf16 greedy non‑determinism, not degradation).
Batching note: spec‑decode wins at low concurrency (interactive). At full batch the GPU saturates and the draft becomes overhead (~0.75× baseline) — disable spec or drop num_speculative_tokens to 1–2 for bulk throughput.
Architecture
EAGLE‑3.1 (single decoder layer) over Chandra‑2's Qwen3.5 text backbone:
hidden_size2560,head_dim256, GQA 16/4 — matched to the target.- EAGLE 3.1:
fc_norm(per‑aux‑layer RMSNorm) +norm_output(post‑norm recurrence) — fixes attention drift, holds acceptance at depth. - Vocab pruning:
draft_vocab_size32768 (from 248320; 99.99% token coverage) → ~7.6× smaller lm_head.d2tbuffer maps pruned ids back to the full vocab. - Aux hidden states from the target's full‑attention layers
[3, 15, 27](Chandra‑2 is a hybrid 3:1 linear/full‑attention model).
Usage (vLLM)
from vllm import LLM
llm = LLM(
model="datalab-to/chandra-ocr-2",
speculative_config={
"model": "jbarrow/joecr-chandra-2-eagle3.1",
"method": "eagle3",
"num_speculative_tokens": 5,
},
limit_mm_per_prompt={"image": 1},
)
Use the Chandra ocr_layout prompt in the user turn (image + instruction).
Training
- On‑policy data: Chandra‑2's own HTML‑with‑layout OCR over ~36.6k document pages.
- Features: hidden states extracted offline from the frozen 5B target (decoupled from draft training so it fits a 24 GB GPU and trains with true DDP gradient sharing).
- Objective: forward‑KL distillation with TTT (length 7), in vLLM's EAGLE input‑shift convention.
- Schedule: 2 epochs via chunked‑rolling DDP (extract a chunk → train → continue), since the full rep set exceeds disk. 4× RTX 3090 Ti.
- Downloads last month
- 37
Model tree for jbarrow/joecr-chandra-2-eagle3.1
Base model
datalab-to/chandra-ocr-2