gemma-4-31B-it EAGLE-3 Draft (Korean-optimized)

An EAGLE-3 draft (speculator) for accelerating Korean generation from gemma-4-31B-it via speculative decoding. Publicly available drafts are trained on English and accept Korean tokens poorly, so this draft was retrained on Korean prompts with on-policy responses regenerated by the verifier itself.

Method: EAGLE-3 (vllm-project/speculators)
Verifier (target): BCCard/gemma-4-31B-it-FP8-Dynamic (FP8; used for serving and hidden-state extraction)
Reused weights: BF16 embed/lm_head from google/gemma-4-31B-it (standard EAGLE-3)
Warm start: RedHatAI/gemma-4-31B-it-speculator.eagle3
Training data: ~150k prompts sampled from sh2orc/bccard-maywell-jojo0217-markai-lcw99-kendamarron-microsoft (1.71M-row Korean and English QA; instruction column only). Answers are discarded and regenerated on-policy by the verifier.
Sequence length: 8192

Serving (vLLM)

VLLM_USE_FLASHINFER_SAMPLER=0 vllm serve BCCard/gemma-4-31B-it-FP8-Dynamic -tp 1 \
  --max-model-len 8192 \
  --speculative-config '{
    "model": "BCCard/MoAI-gemma-4-31B-it-speculator.eagle3",
    "num_speculative_tokens": 4,
    "method": "eagle3",
    "draft_tensor_parallel_size": 1
  }'

Tune num_speculative_tokens in the 4–8 range based on measured acceptance / TPS. The draft uses the verifier's tokenizer.

Performance (per-position acceptance at training time, validation)

position	full_acc	cond_acc
0	0.638	0.638
1	0.380	0.595
2	0.235	0.618

Mean accepted length ≈ 2.3 tokens/step, roughly ~2.2x speedup (measure on your own traffic). Train and validation metrics match closely, so there is no overfitting.

Limitations

Trained on general Korean QA. Domain-specific traffic (e.g. finance) may benefit from one more training cycle on domain-matched data, raising acceptance.
Acceptance is measured against the verifier BCCard/gemma-4-31B-it-FP8-Dynamic. Pairing the draft with a different target will change results.

License

Apache 2.0. The base Gemma 4 (Apache 2.0 since 2026-04, the first Gemma family to adopt it), the verifier BCCard/gemma-4-31B-it-FP8-Dynamic (Apache 2.0), and the RedHat EAGLE-3 warm-start checkpoint are all Apache 2.0, so this draft is released under Apache 2.0 as well. Apache 2.0 requires only attribution of the original copyright and disclosure of modifications, with no restrictions on commercial use, modification, or redistribution. (This is informational, not legal advice.)