gemma-4-31B-it EAGLE-3 Draft (Korean-optimized)

An EAGLE-3 draft (speculator) for accelerating Korean generation from gemma-4-31B-it via speculative decoding. Publicly available drafts are trained on English and accept Korean tokens poorly, so this draft was retrained on Korean prompts with on-policy responses regenerated by the verifier itself.

  • Method: EAGLE-3 (vllm-project/speculators)
  • Verifier (target): BCCard/gemma-4-31B-it-FP8-Dynamic (FP8; used for serving and hidden-state extraction)
  • Reused weights: BF16 embed/lm_head from google/gemma-4-31B-it (standard EAGLE-3)
  • Warm start: RedHatAI/gemma-4-31B-it-speculator.eagle3
  • Training data: ~150k prompts sampled from sh2orc/bccard-maywell-jojo0217-markai-lcw99-kendamarron-microsoft (1.71M-row Korean and English QA; instruction column only). Answers are discarded and regenerated on-policy by the verifier.
  • Sequence length: 8192

Serving (vLLM)

VLLM_USE_FLASHINFER_SAMPLER=0 vllm serve BCCard/gemma-4-31B-it-FP8-Dynamic -tp 1 \
  --max-model-len 8192 \
  --speculative-config '{
    "model": "BCCard/MoAI-gemma-4-31B-it-speculator.eagle3",
    "num_speculative_tokens": 4,
    "method": "eagle3",
    "draft_tensor_parallel_size": 1
  }'

Tune num_speculative_tokens in the 4–8 range based on measured acceptance / TPS. The draft uses the verifier's tokenizer.

Performance (per-position acceptance at training time, validation)

position full_acc cond_acc
0 0.638 0.638
1 0.380 0.595
2 0.235 0.618

Mean accepted length ≈ 2.3 tokens/step, roughly ~2.2x speedup (measure on your own traffic). Train and validation metrics match closely, so there is no overfitting.

Limitations

  • Trained on general Korean QA. Domain-specific traffic (e.g. finance) may benefit from one more training cycle on domain-matched data, raising acceptance.
  • Acceptance is measured against the verifier BCCard/gemma-4-31B-it-FP8-Dynamic. Pairing the draft with a different target will change results.

License

Apache 2.0. The base Gemma 4 (Apache 2.0 since 2026-04, the first Gemma family to adopt it), the verifier BCCard/gemma-4-31B-it-FP8-Dynamic (Apache 2.0), and the RedHat EAGLE-3 warm-start checkpoint are all Apache 2.0, so this draft is released under Apache 2.0 as well. Apache 2.0 requires only attribution of the original copyright and disclosure of modifications, with no restrictions on commercial use, modification, or redistribution. (This is informational, not legal advice.)

Downloads last month
33
Safetensors
Model size
2B params
Tensor type
I64
·
BF16
·
BOOL
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BCCard/MoAI-gemma-4-31B-it-speculator.eagle3

Finetuned
(1)
this model

Dataset used to train BCCard/MoAI-gemma-4-31B-it-speculator.eagle3

Collection including BCCard/MoAI-gemma-4-31B-it-speculator.eagle3