EAGLE-3 Draft Model for t-tech/T-pro-it-2.1-FP8

EAGLE-3 draft model for accelerating t-tech/T-pro-it-2.1-FP8 inference with speculative decoding.

EAGLE-3 is lossless: the draft proposes several tokens and the base model verifies them in a single forward pass, so the output is identical to standard decoding. The quality metric is mean acceptance length (tokens accepted per base-model forward); a value > 2.0 gives a useful speedup. Wall-clock tok/s depends on hardware, batch size and context length — measure it on your own setup.

What is EAGLE-3?

EAGLE-3 is a speculative decoding method in which a small (~1B) draft model predicts several tokens ahead, while the larger base model verifies them in a single forward pass. Unlike EAGLE/EAGLE-2, EAGLE-3 predicts tokens directly (without feature prediction) and fuses features from multiple layers of the target model (low-, mid-, and high-level).

Usage — vLLM (v0.9+)

This repo is in vLLM EAGLE-3 format (model.safetensors + Eagle3LlamaForCausalLM config). Serve it as the speculative model:

vllm serve t-tech/T-pro-it-2.1-FP8 \
    --speculative-config '{"model":"VirVen/T-pro-it-2.1-EAGLE_V3","method":"eagle3","num_speculative_tokens":5}' \
    --dtype bfloat16 \
    --tensor-parallel-size 2

Training Details

Base model: t-tech/T-pro-it-2.1-FP8 (Qwen3-32B architecture, FP8)
Draft model: ~1B parameters, one transformer layer, attention input size of 2×hidden_size (concatenated token embedding and fused features)
Feature fusion: layers 8 (low-level), 32 (mid-level), and 62 (high-level) of the target model
Data: ~50k examples from the Saiga dataset (Russian-language conversations)
Training: DeepSpeed ZeRO-2