joecr-chandra-2-eagle3.1

An EAGLE‑3.1 speculative‑decoding draft head for datalab-to/chandra-ocr-2 (a 5B Qwen3.5‑based vision‑language OCR model). Drop it into vLLM as the speculator to accelerate single‑stream (latency‑bound) OCR decoding losslessly — every drafted token is verified by the target, so output quality is preserved.

Results

Speedup (single‑stream, greedy, vLLM + CUDA graphs, RTX 3090 Ti):

tok/s speedup
baseline (no spec) 58.0 1.0×
+ this head (num_spec=5) 104.0 1.79×

Measured on olmOCR‑bench pages. The ratio is condition‑dependent: it's highest at low GPU power / low concurrency (the regime above) and drops toward ~1.4× at full power, and below 1× at high batch (saturated GPU — see note). Treat ~1.4–1.8× single‑stream as the realistic range.

Serving acceptance (held‑out OmniDocBench, 80 pages, num_spec=5):

mean accepted length τ per‑position a0…a4
this head 2.31 [0.49, 0.31, 0.22, 0.16, 0.12]

(Note: in‑distribution training acc0 ≈ 0.92 — that's a teacher‑forced ceiling on the training corpus, not the serving number above. Real‑world acceptance depends on the document distribution.)

Quality: lossless in content — median char‑similarity vs baseline ≈ 0.999; differences are confined to bbox‑pixel jitter / equivalent figure captions (bf16 greedy non‑determinism, not degradation).

Batching note: spec‑decode wins at low concurrency (interactive). At full batch the GPU saturates and the draft becomes overhead (~0.75× baseline) — disable spec or drop num_speculative_tokens to 1–2 for bulk throughput.

Architecture

EAGLE‑3.1 (single decoder layer) over Chandra‑2's Qwen3.5 text backbone:

  • hidden_size 2560, head_dim 256, GQA 16/4 — matched to the target.
  • EAGLE 3.1: fc_norm (per‑aux‑layer RMSNorm) + norm_output (post‑norm recurrence) — fixes attention drift, holds acceptance at depth.
  • Vocab pruning: draft_vocab_size 32768 (from 248320; 99.99% token coverage) → ~7.6× smaller lm_head. d2t buffer maps pruned ids back to the full vocab.
  • Aux hidden states from the target's full‑attention layers [3, 15, 27] (Chandra‑2 is a hybrid 3:1 linear/full‑attention model).

Usage (vLLM)

from vllm import LLM
llm = LLM(
    model="datalab-to/chandra-ocr-2",
    speculative_config={
        "model": "jbarrow/joecr-chandra-2-eagle3.1",
        "method": "eagle3",
        "num_speculative_tokens": 5,
    },
    limit_mm_per_prompt={"image": 1},
)

Use the Chandra ocr_layout prompt in the user turn (image + instruction).

Training

  • On‑policy data: Chandra‑2's own HTML‑with‑layout OCR over ~36.6k document pages.
  • Features: hidden states extracted offline from the frozen 5B target (decoupled from draft training so it fits a 24 GB GPU and trains with true DDP gradient sharing).
  • Objective: forward‑KL distillation with TTT (length 7), in vLLM's EAGLE input‑shift convention.
  • Schedule: 2 epochs via chunked‑rolling DDP (extract a chunk → train → continue), since the full rep set exceeds disk. 4× RTX 3090 Ti.
Downloads last month
37
Safetensors
Model size
0.9B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jbarrow/joecr-chandra-2-eagle3.1

Finetuned
(5)
this model