EAGLE-3 Draft Model for t-tech/T-pro-it-2.1-FP8

EAGLE-3 draft model for accelerating t-tech/T-pro-it-2.1-FP8 inference with speculative decoding.

EAGLE-3 is lossless: the draft proposes several tokens and the base model verifies them in a single forward pass, so the output is identical to standard decoding. The quality metric is mean acceptance length (tokens accepted per base-model forward); a value > 2.0 gives a useful speedup. Wall-clock tok/s depends on hardware, batch size and context length — measure it on your own setup.

What is EAGLE-3?

EAGLE-3 is a speculative decoding method in which a small (~1B) draft model predicts several tokens ahead, while the larger base model verifies them in a single forward pass. Unlike EAGLE/EAGLE-2, EAGLE-3 predicts tokens directly (without feature prediction) and fuses features from multiple layers of the target model (low-, mid-, and high-level).

Usage — vLLM (v0.9+)

This repo is in vLLM EAGLE-3 format (model.safetensors + Eagle3LlamaForCausalLM config). Serve it as the speculative model:

vllm serve t-tech/T-pro-it-2.1-FP8 \
    --speculative-config '{"model":"VirVen/T-pro-it-2.1-EAGLE_V3","method":"eagle3","num_speculative_tokens":5}' \
    --dtype bfloat16 \
    --tensor-parallel-size 2

Training Details

  • Base model: t-tech/T-pro-it-2.1-FP8 (Qwen3-32B architecture, FP8)
  • Draft model: ~1B parameters, one transformer layer, attention input size of 2×hidden_size (concatenated token embedding and fused features)
  • Feature fusion: layers 8 (low-level), 32 (mid-level), and 62 (high-level) of the target model
  • Data: ~50k examples from the Saiga dataset (Russian-language conversations)
  • Training: DeepSpeed ZeRO-2
Downloads last month
172
Safetensors
Model size
2B params
Tensor type
I64
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VirVen/T-pro-it-2.1-EAGLE_V3

Base model

Qwen/Qwen3-32B
Finetuned
(2)
this model

Paper for VirVen/T-pro-it-2.1-EAGLE_V3