EAGLE-3 Draft Model for t-tech/T-pro-it-2.1-FP8
EAGLE-3 draft model for accelerating t-tech/T-pro-it-2.1-FP8 inference with speculative decoding.
EAGLE-3 is lossless: the draft proposes several tokens and the base model verifies them in a single forward pass, so the output is identical to standard decoding. The quality metric is mean acceptance length (tokens accepted per base-model forward); a value > 2.0 gives a useful speedup. Wall-clock tok/s depends on hardware, batch size and context length — measure it on your own setup.
What is EAGLE-3?
EAGLE-3 is a speculative decoding method in which a small (~1B) draft model predicts several tokens ahead, while the larger base model verifies them in a single forward pass. Unlike EAGLE/EAGLE-2, EAGLE-3 predicts tokens directly (without feature prediction) and fuses features from multiple layers of the target model (low-, mid-, and high-level).
Usage — vLLM (v0.9+)
This repo is in vLLM EAGLE-3 format (model.safetensors + Eagle3LlamaForCausalLM
config). Serve it as the speculative model:
vllm serve t-tech/T-pro-it-2.1-FP8 \
--speculative-config '{"model":"VirVen/T-pro-it-2.1-EAGLE_V3","method":"eagle3","num_speculative_tokens":5}' \
--dtype bfloat16 \
--tensor-parallel-size 2
Training Details
- Base model:
t-tech/T-pro-it-2.1-FP8(Qwen3-32B architecture, FP8) - Draft model: ~1B parameters, one transformer layer, attention input size of 2×hidden_size (concatenated token embedding and fused features)
- Feature fusion: layers 8 (low-level), 32 (mid-level), and 62 (high-level) of the target model
- Data: ~50k examples from the Saiga dataset (Russian-language conversations)
- Training: DeepSpeed ZeRO-2
- Downloads last month
- 172